[00:00:04] AndyRussG and ejegg: Respected human, time to deploy CentralNotice update (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170613T0000). Please do the needful. [00:19:52] !log andyrussg@tin Started scap: Update CentralNotice [00:20:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:40:44] !log andyrussg@tin Finished scap: Update CentralNotice (duration: 20m 51s) [00:40:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:49:34] from the looks of it the actual deployment went ok, still some messages missing in the js-side of things, and still some verification steps to be done, but deployment seemed to go smoothly. Stepping afk, but still available if rollback needs to happen. [01:02:26] thcipriani: ejegg: the code that goes out to all wikis seems fine. The test campaign on aa.wikibooks is good--go to https://aa.wikibooks.org/wiki/Main_Page , and keep reloading, you should see the banner twice, then no banner one time, then it should cycle [01:02:30] Also mobile site looks fine [01:02:34] Just gonna check logstash again [01:02:45] great! [01:02:51] However, the i18n messages for the admin interface are still not there :( [01:03:02] https://meta.wikimedia.org/w/index.php?title=Special:CentralNotice&subaction=noticeDetail¬ice=test [01:03:06] (you need permissions) [01:03:12] ejegg: thcipriani: ^ [01:04:29] Hmmm RoanKattouw Krinkle ^ [01:04:51] (Just did a full scap, but new i18n messages not making it to JS apparently) [01:04:53] hrm, resourceloader l10n stuff is not something I'm super familiar with [01:05:12] lemme try w/ debug [01:05:34] https://en.wikipedia.org/wiki/MediaWiki:Centralnotice-banner-sequence-days [01:05:43] message seems like it's "there" [01:06:01] Hmm no dice w/ debug=true either [01:06:13] Yeah [01:06:15] same for me [01:06:43] Maybe we have to re-jiggle that RL module so it creates a new hash or something [01:06:46] yeah, the mixin name and description are there [01:06:47] lemme check msg_resource table [01:07:03] ejegg: yea those are inserted via PHP [01:07:38] I think we have to trick RL to thinking that the module that loads those has changed [01:08:06] hm, that seems inadequate [01:08:22] These are messages that go out with the ext.centralNotice.adminUi.bannerSequence module [01:08:41] I think once before someone added an extraneous space somewhere when something similar happened [01:08:59] Must be that RL built the module before the full scap had finished [01:10:22] must have been something like that [01:10:56] even with debug=true, the same request that pulls down info about the all-new bannerSequence js module also pulls down the incomplete message strings [01:11:20] I can poke [01:11:34] There's a magical thing you can do in eval.php with MessageBlobStore [01:11:39] Which we should probably document somewhere [01:11:59] AndyRussG: Where can I see this fail? [01:12:23] RoanKattouw: https://meta.wikimedia.org/w/index.php?title=Special:CentralNotice&subaction=noticeDetail¬ice=test [01:12:39] vs https://meta.wikimedia.org/wiki/MediaWiki:Centralnotice-banner-sequence-days [01:12:42] ⧼centralnotice-banner-sequence-detailed-help⧽ gotcha [01:13:22] I have a vague memory of truncating msg_resource in the past but was not too keen on doing that again. [01:13:30] RoanKattouw: I was thinking of maybe just adding an extraneous space in that RL module and scap-file... But if there's something better ... :) [01:13:56] I just tried recaching it [01:14:08] Like so: [01:14:10] https://www.irccloud.com/pastebin/yaJULdcw/ [01:14:33] But I don't see it working yet [01:15:21] I do see it working [01:15:45] Oh cool [01:15:47] I do also see it at https://meta.wikimedia.org/w/load.php?modules=ext.centralNotice.adminUi.bannerSequence now [01:15:52] same here! [01:15:57] RoanKattouw: yeah all great! [01:16:00] http://tyler.zone/central-notice-help.png [01:16:01] thx much!!! [01:16:18] It's not playing ball for me yet [01:16:24] But maybe it will eventually? [01:16:37] thcipriani: that's what you're seeing at that URL? [01:16:42] yes [01:16:44] That should never happen... [01:17:03] seeing at: https://meta.wikimedia.org/w/index.php?title=Special:CentralNotice&subaction=noticeDetail¬ice=test [01:17:17] oh good [01:17:29] Why not? It looks exactly like what I see except with the messages filled in [01:17:30] thcipriani: ah wait maybe it's cause you don't have permissions [01:17:44] I'm not logged in [01:18:53] thcipriani: RoanKattouw: OK hmm well that's a case we need to fix... How it looks to non-logged-in users. Not a message issue then [01:18:59] Yeah I get the same in a private window [01:19:27] Hah the message wfm in a private window now [01:20:14] RoanKattouw: wfm? [01:20:29] ah works for me [01:20:32] cool [01:20:49] Yeah log in, if you have CN rights you should see a bunch more controls [01:20:57] ejegg: all the controls look OK logged in 4 u? [01:21:12] yep, they did as soon as Roan did his thing [01:23:37] cool! [01:23:47] RoanKattouw: thcipriani: ejegg: thx so much!!!!!! [01:24:24] glad everything is working, thanks for RL knowledge RoanKattouw [01:24:28] thank you for shepherding this thing through! [01:25:02] (ejegg: yeah really that whole part of the interface is not nice for people without CN rights, we should make a ticket for that... I guess we rarely hear from people w/ out rights looking at that page, but you should at least be able to see what the settings are...) [01:29:55] !log OS install on labtestnet2002 [01:30:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:51:38] 10Operations, 10ops-codfw, 10Labs, 10Labs-Infrastructure: rack/setup/install labtestpuppetmaster2001 - https://phabricator.wikimedia.org/T167157#3343358 (10Papaul) [01:51:40] 10Operations, 10ops-codfw, 10Labs, 10Labs-Infrastructure, 10netops: codfw: labtestpuppetmaster2001 switch port configuration - https://phabricator.wikimedia.org/T167321#3343355 (10Papaul) 05Resolved>03Open [02:05:11] 10Operations, 10Technical-Debt: Supersede RT tickets references - https://phabricator.wikimedia.org/T165733#3343366 (10Dzahn) RT #4566 - not available, 'domains' queue has never been imported to phab :( RT #4579 - not available, 'domains' queue has never been imported to phab :( RT #4581 - not available, 'doma... [02:06:17] 10Operations: Puppet should set umask 0002 for newly created wikidev users - https://phabricator.wikimedia.org/T79400#862209 (10Dzahn) [02:07:30] PROBLEM - very high load average likely xfs on ms-be1019 is CRITICAL: CRITICAL - load average: 138.32, 101.10, 73.00 [02:13:30] RECOVERY - very high load average likely xfs on ms-be1019 is OK: OK - load average: 62.48, 78.55, 72.65 [02:25:42] 10Operations, 10ops-requests: Please create private transitionteam mailman list & wiki - https://phabricator.wikimedia.org/T82329#3343381 (10Dzahn) [02:25:42] 10Operations, 10ops-requests: Please create private IEGCom wiki - https://phabricator.wikimedia.org/T82498#3343385 (10Dzahn) [02:27:11] !log l10nupdate@tin scap sync-l10n completed (1.30.0-wmf.4) (duration: 08m 00s) [02:27:20] 10Operations: Automatic periodic runs of "refreshLinks.php --dfn-only" on all wikis - https://phabricator.wikimedia.org/T80599#3343387 (10Dzahn) [02:27:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:28:27] 10Operations: teach nagios to check for swift storage processes - https://phabricator.wikimedia.org/T80722#3343393 (10Dzahn) [02:29:34] 10Operations: A couple of Bugzilla SQL requests for Community Metrics - https://phabricator.wikimedia.org/T81784#3343408 (10Dzahn) [02:33:23] !log l10nupdate@tin ResourceLoader cache refresh completed at Tue Jun 13 02:33:23 UTC 2017 (duration 6m 12s) [02:33:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:39:41] (03PS1) 10Dzahn: replace references to RT tickets with Phab ticket numbers [puppet] - 10https://gerrit.wikimedia.org/r/358500 (https://phabricator.wikimedia.org/T165733) [03:16:40] (03PS1) 10Dzahn: base/puppet: add "daemonize = no" to agent config [puppet] - 10https://gerrit.wikimedia.org/r/358501 (https://phabricator.wikimedia.org/T166371) [03:18:00] (03PS2) 10Dzahn: base/puppet: add "daemonize = no" to agent config [puppet] - 10https://gerrit.wikimedia.org/r/358501 (https://phabricator.wikimedia.org/T166371) [03:29:24] (03PS7) 10Dzahn: wikistats: add cron jobs for XML dumps [puppet] - 10https://gerrit.wikimedia.org/r/358150 [03:29:48] (03PS8) 10Dzahn: wikistats: add cron jobs for XML dumps [puppet] - 10https://gerrit.wikimedia.org/r/358150 [03:44:51] (03CR) 10Dzahn: [C: 032] wikistats: add cron jobs for XML dumps [puppet] - 10https://gerrit.wikimedia.org/r/358150 (owner: 10Dzahn) [03:54:08] (03PS1) 10Dzahn: wikistats: move dump dir out of define, duplicate declaration [puppet] - 10https://gerrit.wikimedia.org/r/358503 [03:54:48] (03CR) 10Dzahn: [C: 04-1] gerrit: let Apache proxy only listen on service IP [puppet] - 10https://gerrit.wikimedia.org/r/354078 (owner: 10Dzahn) [03:55:27] (03CR) 10jerkins-bot: [V: 04-1] wikistats: move dump dir out of define, duplicate declaration [puppet] - 10https://gerrit.wikimedia.org/r/358503 (owner: 10Dzahn) [03:55:29] (03PS2) 10Dzahn: wikistats: move dump dir out of define, duplicate declaration [puppet] - 10https://gerrit.wikimedia.org/r/358503 [03:56:40] (03CR) 10Dzahn: [C: 032] "labs-only and fixing puppet error. define is used with a hash, directory needs to move out of it." [puppet] - 10https://gerrit.wikimedia.org/r/358503 (owner: 10Dzahn) [03:56:42] (03CR) 10jerkins-bot: [V: 04-1] wikistats: move dump dir out of define, duplicate declaration [puppet] - 10https://gerrit.wikimedia.org/r/358503 (owner: 10Dzahn) [03:57:42] (03CR) 10Dzahn: "top-scope variable being used without an explicit namespace" [puppet] - 10https://gerrit.wikimedia.org/r/358503 (owner: 10Dzahn) [03:59:29] (03PS3) 10Dzahn: wikistats: move dump dir out of define, duplicate declaration [puppet] - 10https://gerrit.wikimedia.org/r/358503 [04:01:44] (03CR) 10Dzahn: [C: 032] wikistats: move dump dir out of define, duplicate declaration [puppet] - 10https://gerrit.wikimedia.org/r/358503 (owner: 10Dzahn) [04:01:51] (03PS4) 10Dzahn: wikistats: move dump dir out of define, duplicate declaration [puppet] - 10https://gerrit.wikimedia.org/r/358503 [04:11:40] PROBLEM - mailman I/O stats on fermium is CRITICAL: CRITICAL - I/O stats: Transfers/Sec=1042.30 Read Requests/Sec=309.40 Write Requests/Sec=3.90 KBytes Read/Sec=39566.40 KBytes_Written/Sec=110.80 [04:16:35] (03PS1) 10Dzahn: lists/icinga: remove mailman I/O stat CRITs [puppet] - 10https://gerrit.wikimedia.org/r/358504 [04:16:50] (03CR) 10jerkins-bot: [V: 04-1] lists/icinga: remove mailman I/O stat CRITs [puppet] - 10https://gerrit.wikimedia.org/r/358504 (owner: 10Dzahn) [04:17:37] (03CR) 10Dzahn: [C: 04-1] "WIP - let's do something about "PROBLEM - mailman I/O stats on fermium is CRITICAL: CRITICAL - I/O stats: Transfers/Sec=1042.30 Read Reque" [puppet] - 10https://gerrit.wikimedia.org/r/358504 (owner: 10Dzahn) [04:21:40] RECOVERY - mailman I/O stats on fermium is OK: OK - I/O stats: Transfers/Sec=139.60 Read Requests/Sec=262.20 Write Requests/Sec=1.00 KBytes Read/Sec=2361.60 KBytes_Written/Sec=254.40 [04:22:43] 10Operations, 10DBA, 10Performance-Team, 10Traffic, 10Wikidata: Cache invalidations coming from the JobQueue are causing lag on several wikis - https://phabricator.wikimedia.org/T164173#3343495 (10aaron) @daniel , can you look into the amount of purges happening in ChangeNotification jobs? I don't see an... [04:25:20] PROBLEM - Nginx local proxy to apache on mw2218 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [04:26:10] RECOVERY - Nginx local proxy to apache on mw2218 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 613 bytes in 0.054 second response time [04:39:21] PROBLEM - Nginx local proxy to apache on mw2115 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [04:40:11] RECOVERY - Nginx local proxy to apache on mw2115 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 613 bytes in 0.060 second response time [05:21:26] (03CR) 10Faidon Liambotis: [C: 032] "Excellent!" [puppet] - 10https://gerrit.wikimedia.org/r/358501 (https://phabricator.wikimedia.org/T166371) (owner: 10Dzahn) [05:21:31] (03PS3) 10Faidon Liambotis: base/puppet: add "daemonize = no" to agent config [puppet] - 10https://gerrit.wikimedia.org/r/358501 (https://phabricator.wikimedia.org/T166371) (owner: 10Dzahn) [05:31:16] (03PS1) 10KartikMistry: apertium-spa: New upstream release [debs/contenttranslation/apertium-spa] - 10https://gerrit.wikimedia.org/r/358512 (https://phabricator.wikimedia.org/T167247) [05:57:46] (03PS1) 10KartikMistry: apertium-ita: New upstream release [debs/contenttranslation/apertium-ita] - 10https://gerrit.wikimedia.org/r/358515 (https://phabricator.wikimedia.org/T167247) [06:19:41] (03PS4) 10KartikMistry: apertium-cat: New upstream release [debs/contenttranslation/apertium-cat] - 10https://gerrit.wikimedia.org/r/358354 (https://phabricator.wikimedia.org/T167247) [06:21:00] (03PS1) 10KartikMistry: apertium-fra-cat: New upstream release [debs/contenttranslation/apertium-fra-cat] - 10https://gerrit.wikimedia.org/r/358517 (https://phabricator.wikimedia.org/T167247) [06:21:40] (03CR) 10jerkins-bot: [V: 04-1] apertium-fra-cat: New upstream release [debs/contenttranslation/apertium-fra-cat] - 10https://gerrit.wikimedia.org/r/358517 (https://phabricator.wikimedia.org/T167247) (owner: 10KartikMistry) [06:26:00] RECOVERY - ores on scb2005 is OK: HTTP OK: HTTP/1.0 200 OK - 3666 bytes in 0.025 second response time [06:38:36] (03PS1) 10Marostegui: db-eqiad.php: Depool db1089 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/358518 (https://phabricator.wikimedia.org/T166935) [06:40:32] 10Operations, 10Wikimedia-Language-setup, 10Wikimedia-Site-requests, 10MW-1.30-release-notes (WMF-deploy-2017-06-13_(1.30.0-wmf.5)), 10Patch-For-Review: Create Atikamekw Wikipedia - https://phabricator.wikimedia.org/T167714#3343577 (10Marostegui) p:05Triage>03Normal [06:41:00] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Depool db1089 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/358518 (https://phabricator.wikimedia.org/T166935) (owner: 10Marostegui) [06:42:19] (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1089 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/358518 (https://phabricator.wikimedia.org/T166935) (owner: 10Marostegui) [06:42:33] (03CR) 10jenkins-bot: db-eqiad.php: Depool db1089 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/358518 (https://phabricator.wikimedia.org/T166935) (owner: 10Marostegui) [06:43:23] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Depool db1089 - T166935 (duration: 00m 42s) [06:43:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:43:33] T166935: db1089: update RAID controller firwmare - https://phabricator.wikimedia.org/T166935 [06:43:48] !log Stop MySQL on db1089 to upgrade its raid controller firmware - T166935 [06:43:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:51:42] !log installing libtasn security updates [06:51:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:55:59] !log executed "cumin 'mw2*.codfw.wmnet' 'find /var/log/hhvm/* -user root -exec chown www-data:www-data {} \;'" to fix the last occurences of wrong root:adm hhvm log occurrences [06:56:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:57:25] (too many 'occurrences', just realized) [06:57:27] * elukey fixes the log [06:58:51] 10Operations, 10ops-eqiad, 10DBA, 10Patch-For-Review: db1089: update RAID controller firwmare - https://phabricator.wikimedia.org/T166935#3343584 (10Marostegui) 05Open>03Resolved a:05Cmjohnson>03Marostegui As Chris was having issues yesterday with the HP bundle, we decided that I would try to upgra... [07:00:54] 10Operations, 10ops-codfw, 10media-storage: audit / test / upgrade hp smartarray P840 firmware - https://phabricator.wikimedia.org/T141756#3343589 (10Marostegui) For the record, db1089 has been upgraded to the latest firmware which is now 5.04 for the `P840` model. The extracted RPM is at: `db1089:/home/mar... [07:06:13] elukey: \o/ [07:08:06] elukey: I have brought scb2005 up, but I would like to give it one more restart to see what happens with its ifaces, can I reboot it just like that? [07:08:59] (03CR) 10Filippo Giunchedi: [C: 031] Use ffmpeg from jessie-backports on jessie-based video scalers [puppet] - 10https://gerrit.wikimedia.org/r/358381 (https://phabricator.wikimedia.org/T145742) (owner: 10Muehlenhoff) [07:10:19] marostegui: yeah there shouldn't be any issues [07:11:01] ok, thanks! [07:12:30] !log Reboot scb2005 - T167638 [07:12:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:12:38] T167638: scb2005 eth0 interface gets renamed to eth2 - https://phabricator.wikimedia.org/T167638 [07:12:49] (03PS2) 10KartikMistry: apertium-fra-cat: New upstream release [debs/contenttranslation/apertium-fra-cat] - 10https://gerrit.wikimedia.org/r/358517 (https://phabricator.wikimedia.org/T167247) [07:13:00] RECOVERY - Check systemd state on scb2005 is OK: OK - unknown: The operational state could not be determined, due to lack of resources or another error cause. [07:13:20] (03CR) 10jerkins-bot: [V: 04-1] apertium-fra-cat: New upstream release [debs/contenttranslation/apertium-fra-cat] - 10https://gerrit.wikimedia.org/r/358517 (https://phabricator.wikimedia.org/T167247) (owner: 10KartikMistry) [07:14:06] (03PS1) 10Marostegui: db-eqiad.php: Repool db1089 with less weight [mediawiki-config] - 10https://gerrit.wikimedia.org/r/358521 [07:15:01] PROBLEM - SSH on scb2005 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [07:15:01] PROBLEM - pdfrender on scb2005 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [07:15:10] PROBLEM - eventstreams on scb2005 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [07:15:10] PROBLEM - ores uWSGI web app on scb2005 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [07:15:10] PROBLEM - nutcracker process on scb2005 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [07:15:10] PROBLEM - apertium apy on scb2005 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [07:15:10] PROBLEM - mathoid endpoints health on scb2005 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [07:15:10] PROBLEM - changeprop endpoints health on scb2005 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [07:15:10] PROBLEM - trendingedits endpoints health on scb2005 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [07:15:11] PROBLEM - ores on scb2005 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [07:15:11] PROBLEM - graphoid endpoints health on scb2005 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [07:15:30] PROBLEM - citoid endpoints health on scb2005 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [07:15:30] PROBLEM - mobileapps endpoints health on scb2005 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [07:15:30] PROBLEM - cxserver endpoints health on scb2005 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [07:15:34] That was me, I thought it was downtimed from yesterday [07:15:40] PROBLEM - Check size of conntrack table on scb2005 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [07:16:01] PROBLEM - Disk space on scb2005 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [07:16:10] PROBLEM - MD RAID on scb2005 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [07:16:10] PROBLEM - configured eth on scb2005 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [07:16:10] PROBLEM - dhclient process on scb2005 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [07:16:10] PROBLEM - DPKG on scb2005 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [07:16:10] PROBLEM - puppet last run on scb2005 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [07:16:10] PROBLEM - salt-minion processes on scb2005 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [07:16:10] PROBLEM - Check systemd state on scb2005 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [07:16:20] RECOVERY - mobileapps endpoints health on scb2005 is OK: All endpoints are healthy [07:16:30] RECOVERY - Check size of conntrack table on scb2005 is OK: OK: nf_conntrack is 0 % full [07:16:50] RECOVERY - pdfrender on scb2005 is OK: HTTP OK: HTTP/1.1 200 OK - 275 bytes in 0.003 second response time [07:16:50] RECOVERY - SSH on scb2005 is OK: SSH OK - OpenSSH_6.7p1 Debian-5+deb8u3 (protocol 2.0) [07:16:50] RECOVERY - Disk space on scb2005 is OK: DISK OK [07:17:00] RECOVERY - apertium apy on scb2005 is OK: HTTP OK: HTTP/1.1 200 OK - 5632 bytes in 0.002 second response time [07:17:00] RECOVERY - ores on scb2005 is OK: HTTP OK: HTTP/1.0 200 OK - 3666 bytes in 0.002 second response time [07:17:00] RECOVERY - ores uWSGI web app on scb2005 is OK: ● uwsgi-ores.service - uwsgi-ores uwsgi app [07:17:00] RECOVERY - eventstreams on scb2005 is OK: HTTP OK: HTTP/1.1 200 OK - 929 bytes in 0.023 second response time [07:17:00] RECOVERY - Check whether ferm is active by checking the default input chain on scb2005 is OK: OK ferm input default policy is set [07:17:00] RECOVERY - nutcracker process on scb2005 is OK: PROCS OK: 1 process with UID = 112 (nutcracker), command name nutcracker [07:17:00] RECOVERY - configured eth on scb2005 is OK: OK - interfaces up [07:17:01] RECOVERY - Check systemd state on scb2005 is OK: OK - running: The system is fully operational [07:17:01] RECOVERY - dhclient process on scb2005 is OK: PROCS OK: 0 processes with command name dhclient [07:17:02] RECOVERY - DPKG on scb2005 is OK: All packages OK [07:17:02] RECOVERY - salt-minion processes on scb2005 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [07:17:03] RECOVERY - puppet last run on scb2005 is OK: OK: Puppet is currently enabled, last run 31 minutes ago with 0 failures [07:17:20] RECOVERY - cxserver endpoints health on scb2005 is OK: All endpoints are healthy [07:17:20] RECOVERY - citoid endpoints health on scb2005 is OK: All endpoints are healthy [07:18:58] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Repool db1089 with less weight [mediawiki-config] - 10https://gerrit.wikimedia.org/r/358521 (owner: 10Marostegui) [07:19:22] PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] [07:19:56] (03Merged) 10jenkins-bot: db-eqiad.php: Repool db1089 with less weight [mediawiki-config] - 10https://gerrit.wikimedia.org/r/358521 (owner: 10Marostegui) [07:20:08] (03CR) 10jenkins-bot: db-eqiad.php: Repool db1089 with less weight [mediawiki-config] - 10https://gerrit.wikimedia.org/r/358521 (owner: 10Marostegui) [07:21:01] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Repool db1089 with less weight (duration: 00m 41s) [07:21:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:25:47] looks the the 5xx were api [07:26:22] PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] [07:26:35] 10Operations, 10ops-codfw, 10Services (watching): scb2005 eth0 interface gets renamed to eth2 - https://phabricator.wikimedia.org/T167638#3343603 (10Marostegui) 05Open>03Resolved a:03Marostegui I did a second reboot and the mac addresses cache remained untouched and the server is back up normally. Howe... [07:28:22] RECOVERY - Text HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [07:30:37] !log restarting HHVM on mw canaries to pick up libtasn update [07:30:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:36:33] (03PS1) 10Marostegui: db-eqiad.php: Increase db1089 weight [mediawiki-config] - 10https://gerrit.wikimedia.org/r/358522 [07:38:48] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Increase db1089 weight [mediawiki-config] - 10https://gerrit.wikimedia.org/r/358522 (owner: 10Marostegui) [07:39:33] 10Operations, 10Patch-For-Review: setup netmon1002.wikimedia.org - https://phabricator.wikimedia.org/T159756#3343623 (10akosiaris) >>! In T159756#3343000, @RobH wrote: > My understanding is netmon1001 will have a new task made for decommission once netmon1002 replaces it. netmon1001 is out of warranty. That'... [07:40:11] (03Merged) 10jenkins-bot: db-eqiad.php: Increase db1089 weight [mediawiki-config] - 10https://gerrit.wikimedia.org/r/358522 (owner: 10Marostegui) [07:40:20] (03CR) 10jenkins-bot: db-eqiad.php: Increase db1089 weight [mediawiki-config] - 10https://gerrit.wikimedia.org/r/358522 (owner: 10Marostegui) [07:41:39] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Increase db1089 weight (duration: 00m 41s) [07:41:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:43:48] (03CR) 10Alexandros Kosiaris: "Good idea!!! Thanks for this!" [puppet] - 10https://gerrit.wikimedia.org/r/358501 (https://phabricator.wikimedia.org/T166371) (owner: 10Dzahn) [07:46:43] 10Operations, 10Operations-Software-Development, 10monitoring, 10Patch-For-Review: Monitoring: create an alert for daemonized puppet - https://phabricator.wikimedia.org/T166371#3343628 (10akosiaris) The change above makes it impossible to have daemonized agents running as root so I 'd say this is resolved... [07:47:59] !log Drop table updates on enwiki (s1) - T139342 [07:48:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:48:08] T139342: DROP OAI-related tables - https://phabricator.wikimedia.org/T139342 [07:50:32] 10Operations, 10Performance-Team, 10Thumbor, 10MW-1.30-release-notes (WMF-deploy-2017-06-06_(1.30.0-wmf.4)), 10Patch-For-Review: Thumbor should reject thumbnail requests that are the same size as the original or bigger - https://phabricator.wikimedia.org/T150741#3343639 (10Gilles) 05Open>03declined [08:01:39] !log adding elastic2020 back in the elasticsearch cluster - T149006 [08:01:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:01:48] T149006: elastic2020 is powered off and does not want to restart - https://phabricator.wikimedia.org/T149006 [08:02:03] (03CR) 10Alexandros Kosiaris: "Just a bit of more info on this one. Usually when I see this alert, I take a quick look at exim stats in ganglia (AFAIK we don't have thes" [puppet] - 10https://gerrit.wikimedia.org/r/358504 (owner: 10Dzahn) [08:04:03] (03PS35) 10Elukey: role::zookeeper: refactor to multiple profiles [puppet] - 10https://gerrit.wikimedia.org/r/354449 (https://phabricator.wikimedia.org/T114815) [08:05:23] !log gehel@puppetmaster1001 conftool action : set/pooled=yes; selector: name=elastic2020.codfw.wmnet [08:05:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:07:32] 10Operations, 10ops-codfw, 10DC-Ops, 10Discovery, and 3 others: elastic2020 is powered off and does not want to restart - https://phabricator.wikimedia.org/T149006#3343648 (10Gehel) elastic2020 is back into rotation, stress tests show no issue. @debt: this can be closed... [08:10:36] (03PS1) 10Marostegui: db-eqiad.php: Increase db1089 weight [mediawiki-config] - 10https://gerrit.wikimedia.org/r/358523 [08:11:30] (03PS36) 10Elukey: role::zookeeper: refactor to multiple profiles [puppet] - 10https://gerrit.wikimedia.org/r/354449 (https://phabricator.wikimedia.org/T114815) [08:12:29] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Increase db1089 weight [mediawiki-config] - 10https://gerrit.wikimedia.org/r/358523 (owner: 10Marostegui) [08:13:20] mutante: rancid key is for connection, not for deployment, so I think T154943 doesn't apply at all and it's good to have it with a different password. It is also used on different hosts from the deployment ones... [08:13:20] T154943: make deployment SSH keys use the same passphrase - https://phabricator.wikimedia.org/T154943 [08:13:47] (03Merged) 10jenkins-bot: db-eqiad.php: Increase db1089 weight [mediawiki-config] - 10https://gerrit.wikimedia.org/r/358523 (owner: 10Marostegui) [08:14:42] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Increase db1089 weight (duration: 00m 42s) [08:14:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:15:51] (03CR) 10jenkins-bot: db-eqiad.php: Increase db1089 weight [mediawiki-config] - 10https://gerrit.wikimedia.org/r/358523 (owner: 10Marostegui) [08:16:06] 10Operations, 10ops-codfw: Troubleshoot scb2005 NICs - https://phabricator.wikimedia.org/T167763#3343669 (10Marostegui) [08:16:08] 10Operations, 10ops-codfw: Troubleshoot scb2005 NICs - https://phabricator.wikimedia.org/T167763#3343685 (10Marostegui) p:05Triage>03Normal [08:20:28] (03PS37) 10Elukey: role::zookeeper: refactor to multiple profiles [puppet] - 10https://gerrit.wikimedia.org/r/354449 (https://phabricator.wikimedia.org/T114815) [08:21:31] !log restart OSM synchronisation on maps2001 [08:21:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:25:45] 10Operations, 10HHVM, 10Release-Engineering-Team (Kanban), 10Upstream: HHVM 3.18 crashes in realloc() as exposed by luasandbox - https://phabricator.wikimedia.org/T165043#3343697 (10MoritzMuehlenhoff) [08:27:01] 10Operations, 10ops-codfw, 10DBA: Degraded RAID on db2070 - https://phabricator.wikimedia.org/T167667#3343712 (10Marostegui) a:03Papaul [08:27:24] 10Operations, 10HHVM, 10Patch-For-Review, 10Upstream: Build / migrate to HHVM 3.18 - https://phabricator.wikimedia.org/T158176#3343716 (10MoritzMuehlenhoff) [08:27:28] 10Operations, 10HHVM, 10Release-Engineering-Team (Kanban), 10Upstream: HHVM 3.18 crashes in realloc() as exposed by luasandbox - https://phabricator.wikimedia.org/T165043#3255165 (10MoritzMuehlenhoff) 05Open>03Resolved I've built new HHVM packages (3.18.2+wmf5) which include the upstream fix from https... [08:28:13] 10Operations, 10DBA, 10Traffic: Substantive HTTP and mediawiki/database traffic coming from a single ip - https://phabricator.wikimedia.org/T166695#3343718 (10Marostegui) 05stalled>03Open p:05Triage>03Normal Let's close this for now [08:29:27] 10Operations, 10DBA, 10Traffic: Substantive HTTP and mediawiki/database traffic coming from a single ip - https://phabricator.wikimedia.org/T166695#3343722 (10Marostegui) 05Open>03stalled [08:29:59] 10Operations, 10ops-eqiad, 10User-Elukey, 10User-Joe: rack and setup mw1307-1348 - https://phabricator.wikimedia.org/T165519#3343723 (10Marostegui) p:05Triage>03Normal [08:30:56] (03PS38) 10Elukey: role::zookeeper: refactor to multiple profiles [puppet] - 10https://gerrit.wikimedia.org/r/354449 (https://phabricator.wikimedia.org/T114815) [08:32:03] (03PS39) 10Elukey: role::zookeeper: refactor to multiple profiles [puppet] - 10https://gerrit.wikimedia.org/r/354449 (https://phabricator.wikimedia.org/T114815) [08:32:59] 10Operations, 10DC-Ops: Wipe of spare/replacement disks - https://phabricator.wikimedia.org/T166368#3343725 (10Marostegui) p:05Triage>03Normal [08:33:43] sorry for the spam people, trying to make this change a total no-op [08:33:48] but still need to test with pcc [08:33:52] (03PS1) 10Marostegui: db-eqiad.php: Restore db1089 original weight [mediawiki-config] - 10https://gerrit.wikimedia.org/r/358526 [08:34:58] (03PS40) 10Elukey: role::zookeeper: refactor to multiple profiles [puppet] - 10https://gerrit.wikimedia.org/r/354449 (https://phabricator.wikimedia.org/T114815) [08:35:28] (03PS1) 10Marostegui: Revert "db-eqiad.php: Depool db1060" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/358527 [08:35:49] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Restore db1089 original weight [mediawiki-config] - 10https://gerrit.wikimedia.org/r/358526 (owner: 10Marostegui) [08:37:21] (03Merged) 10jenkins-bot: db-eqiad.php: Restore db1089 original weight [mediawiki-config] - 10https://gerrit.wikimedia.org/r/358526 (owner: 10Marostegui) [08:37:30] (03CR) 10jenkins-bot: db-eqiad.php: Restore db1089 original weight [mediawiki-config] - 10https://gerrit.wikimedia.org/r/358526 (owner: 10Marostegui) [08:38:27] (03PS2) 10Marostegui: Revert "db-eqiad.php: Depool db1060" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/358527 [08:38:51] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Restore db1089 original weight - T166935 (duration: 00m 42s) [08:39:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:39:01] T166935: db1089: update RAID controller firwmare - https://phabricator.wikimedia.org/T166935 [08:40:02] (03CR) 10Marostegui: [C: 032] Revert "db-eqiad.php: Depool db1060" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/358527 (owner: 10Marostegui) [08:41:27] (03Merged) 10jenkins-bot: Revert "db-eqiad.php: Depool db1060" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/358527 (owner: 10Marostegui) [08:41:41] (03CR) 10jenkins-bot: Revert "db-eqiad.php: Depool db1060" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/358527 (owner: 10Marostegui) [08:42:29] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Repool db1060 - T166205 (duration: 00m 41s) [08:42:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:42:38] T166205: Convert unique keys into primary keys for some wiki tables on s2 - https://phabricator.wikimedia.org/T166205 [08:43:07] (03CR) 10Faidon Liambotis: [C: 031] "I'm usually skeptical about that practice for things that can break easily between version to version like ffmpeg, as jessie-backports may" [puppet] - 10https://gerrit.wikimedia.org/r/358381 (https://phabricator.wikimedia.org/T145742) (owner: 10Muehlenhoff) [08:45:11] (03CR) 10Faidon Liambotis: [C: 031] Adjust wikimedia.org SPF from neutral (?all) to soft fail (~all) to impede sender address spoofing. [dns] - 10https://gerrit.wikimedia.org/r/358132 (https://phabricator.wikimedia.org/T133191) (owner: 10Herron) [08:47:48] (03CR) 10Faidon Liambotis: [C: 04-1] Adjust wikimedia.org SPF from neutral (?all) to soft fail (~all) to impede sender address spoofing. (031 comment) [dns] - 10https://gerrit.wikimedia.org/r/358132 (https://phabricator.wikimedia.org/T133191) (owner: 10Herron) [08:48:21] (03PS1) 10Marostegui: db-eqiad.php: Depool db1036 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/358529 (https://phabricator.wikimedia.org/T166205) [08:50:44] (03CR) 10Faidon Liambotis: [C: 04-1] Adding logrotate template to set mail::mx exim log retention to 60 days (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/357723 (owner: 10Herron) [08:55:05] (03CR) 10Muehlenhoff: "ffmpeg from jessie-backports will follow the ffmpeg version in stretch, so remain API-stable from now on. But independant of that, I'm pla" [puppet] - 10https://gerrit.wikimedia.org/r/358381 (https://phabricator.wikimedia.org/T145742) (owner: 10Muehlenhoff) [08:56:43] !log upgrading mw1165-mw1167 to HHVM 3.18 [08:56:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:08:46] (03CR) 10Hashar: "Regarding the CA configuration, it looks like Kibana 5.1.x has:" [puppet] - 10https://gerrit.wikimedia.org/r/356900 (owner: 10Hashar) [09:09:05] (03CR) 10Matthias Mullie: "We hope to merge/deploy this next Q - there will probably be some changes in the 3d2png repo, but I doubt these will affect this patch, so" [puppet] - 10https://gerrit.wikimedia.org/r/345377 (https://phabricator.wikimedia.org/T160185) (owner: 10MarkTraceur) [09:13:07] !log Deploy alter table s4 - db1095 - T166206 [09:13:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:13:18] T166206: Convert unique keys into primary keys for some wiki tables on s4 - https://phabricator.wikimedia.org/T166206 [09:19:24] (03CR) 10Faidon Liambotis: [C: 04-1] "That's... a little too much in many regards: to review (for e.g. licenses), to carry in the puppet repository, and to execute in the first" [puppet] - 10https://gerrit.wikimedia.org/r/353937 (owner: 10BryanDavis) [09:22:58] (03PS41) 10Elukey: role::zookeeper: refactor to multiple profiles [puppet] - 10https://gerrit.wikimedia.org/r/354449 (https://phabricator.wikimedia.org/T114815) [09:23:29] (03PS42) 10Elukey: role::zookeeper: refactor to multiple profiles [puppet] - 10https://gerrit.wikimedia.org/r/354449 (https://phabricator.wikimedia.org/T114815) [09:23:31] (03CR) 10Alexandros Kosiaris: [C: 031] role::zookeeper: refactor to multiple profiles [puppet] - 10https://gerrit.wikimedia.org/r/354449 (https://phabricator.wikimedia.org/T114815) (owner: 10Elukey) [09:24:21] (03CR) 10Alexandros Kosiaris: [C: 031] role::zookeeper: refactor to multiple profiles [puppet] - 10https://gerrit.wikimedia.org/r/354449 (https://phabricator.wikimedia.org/T114815) (owner: 10Elukey) [09:24:36] (03CR) 10Zfilipin: "The problem has been reported as https://phabricator.wikimedia.org/T167773" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/322247 (https://phabricator.wikimedia.org/T150618) (owner: 10Urbanecm) [09:24:54] (03CR) 10Elukey: [C: 032] role::zookeeper: refactor to multiple profiles [puppet] - 10https://gerrit.wikimedia.org/r/354449 (https://phabricator.wikimedia.org/T114815) (owner: 10Elukey) [09:26:29] Puppet has been disabled on kafka*, analytics*, druid* [09:26:33] and conf* [09:26:37] log it ;) [09:26:39] will proceed with the rollout [09:27:06] !log puppet disabled on kafka*, analytics*, druid*, conf* for https://gerrit.wikimedia.org/r/354449 - incremental rollout [09:27:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:28:03] (03PS3) 10Muehlenhoff: Tighten access to zookeeper [puppet] - 10https://gerrit.wikimedia.org/r/356548 (https://phabricator.wikimedia.org/T114815) [09:28:27] !log upgrading mw1276-mw1282 to HHVM 3.18 [09:28:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:30:26] (03PS4) 10Filippo Giunchedi: Disable Thumbor dual-serving and serve testwiki with Thumbor [puppet] - 10https://gerrit.wikimedia.org/r/357968 (https://phabricator.wikimedia.org/T167490) (owner: 10Gilles) [09:35:21] (03CR) 10Filippo Giunchedi: [C: 032] Disable Thumbor dual-serving and serve testwiki with Thumbor [puppet] - 10https://gerrit.wikimedia.org/r/357968 (https://phabricator.wikimedia.org/T167490) (owner: 10Gilles) [09:35:44] gilles: ^ [09:37:52] !log disable thumbor shadow requests, enable thumbor-only serving for testwiki - T167490 [09:38:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:38:02] T167490: Disable Thumbor dual-serving - https://phabricator.wikimedia.org/T167490 [09:40:34] PROBLEM - very high load average likely xfs on ms-be1019 is CRITICAL: CRITICAL - load average: 126.64, 104.32, 79.87 [09:49:35] RECOVERY - very high load average likely xfs on ms-be1019 is OK: OK - load average: 41.05, 71.21, 77.37 [09:54:27] !log upgrading mw2248-mw2250 to HHVM 3.18 [09:54:30] 10Operations, 10ops-eqiad, 10User-Joe: Decom mw1170-mw1179, and replace them with new systems. - https://phabricator.wikimedia.org/T167130#3344091 (10Marostegui) p:05Triage>03Normal [09:54:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:59:42] 10Operations: appserver fatals - intermittent failed connections to rdb2005 - https://phabricator.wikimedia.org/T163405#3344102 (10Marostegui) 05Open>03Resolved p:05Triage>03Normal Going to close this as looking at the last 60 days of fatalmonitor doesn't show this host there anymore on the top hosts. Fe... [10:00:52] 10Operations, 10ops-esams, 10Traffic: Degraded RAID on lvs3001 - https://phabricator.wikimedia.org/T166965#3344106 (10Marostegui) p:05Triage>03Normal [10:05:08] 10Operations, 10ops-codfw, 10hardware-requests, 10User-fgiunchedi: Decommission ms-be2001 - ms-be2012 - https://phabricator.wikimedia.org/T162785#3344111 (10Marostegui) a:03Papaul [10:05:14] 10Operations, 10ops-codfw, 10hardware-requests, 10User-fgiunchedi: Decommission ms-be2001 - ms-be2012 - https://phabricator.wikimedia.org/T162785#3174562 (10Marostegui) p:05Triage>03Normal [10:11:11] !log completed rollout of https://gerrit.wikimedia.org/r/354449 [10:11:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:16:53] 10Operations, 10DBA, 10Wikimedia-Site-requests: Global rename of Idh0854 -> Garam: supervision needed - https://phabricator.wikimedia.org/T167031#3344164 (10MarcoAurelio) @greg Can we get a one hour window on wikitech:Deployments in coordination with @jcrespo and @Marostegui to do this? Thanks! [10:17:38] (03PS1) 10Gilles: Re-enable Thumbor thumbnail Swift storage [puppet] - 10https://gerrit.wikimedia.org/r/358548 (https://phabricator.wikimedia.org/T167783) [10:19:26] (03CR) 10Filippo Giunchedi: [C: 032] Re-enable Thumbor thumbnail Swift storage [puppet] - 10https://gerrit.wikimedia.org/r/358548 (https://phabricator.wikimedia.org/T167783) (owner: 10Gilles) [10:21:09] gilles: ^ {{done}} restarting thumbor [10:21:16] 10Operations, 10DBA, 10Wikimedia-Site-requests: Global rename of Idh0854 → Garam: supervision needed - https://phabricator.wikimedia.org/T167031#3344183 (10MarcoAurelio) [10:21:37] !log reenable thumbor swift storage, same paths as mediawiki - T167783 [10:21:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:21:46] T167783: Re-enable saving Thumbor thumbnails to Swift - https://phabricator.wikimedia.org/T167783 [10:23:12] (03CR) 10Jonas Kress (WMDE): [C: 031] Add “Constraints” section for constraint statements [mediawiki-config] - 10https://gerrit.wikimedia.org/r/358343 (https://phabricator.wikimedia.org/T167126) (owner: 10Lucas Werkmeister (WMDE)) [10:28:00] (03PS1) 10Gilles: Deploy Thumbor to group0 wikis [puppet] - 10https://gerrit.wikimedia.org/r/358551 (https://phabricator.wikimedia.org/T167782) [10:33:25] (03PS1) 10Lucas Werkmeister (WMDE): Configure WikibaseQualityConstraints extension [mediawiki-config] - 10https://gerrit.wikimedia.org/r/358553 [10:59:11] !log upgrading mw1283-mw1290 to HHVM 3.18 [10:59:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:11:48] (03CR) 10Jonas Kress (WMDE): [C: 031] Configure WikibaseQualityConstraints extension [mediawiki-config] - 10https://gerrit.wikimedia.org/r/358553 (owner: 10Lucas Werkmeister (WMDE)) [11:18:10] (03PS1) 10KartikMistry: apertium-spa-ita: New upstream release [debs/contenttranslation/apertium-spa-ita] - 10https://gerrit.wikimedia.org/r/358570 (https://phabricator.wikimedia.org/T167247) [11:20:24] 10Operations, 10Upstream: ferm broken in stretch - https://phabricator.wikimedia.org/T166653#3344303 (10MoritzMuehlenhoff) 05Open>03Resolved a:03MoritzMuehlenhoff This is fixed in the stretch-wikimedia package. [11:21:43] (03CR) 10Thiemo Mättig (WMDE): [C: 031] "We need a fresh deployment of the WikimediaMessages extension first before this can be merged, otherwise users will see an ugly " (03CR) 10Alexandros Kosiaris: [C: 032] Introduce kubestagetcd100{1,2,3} and neon.eqiad.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/358344 (https://phabricator.wikimedia.org/T162045) (owner: 10Alexandros Kosiaris) [11:31:47] (03PS3) 10Alexandros Kosiaris: Introduce kubestagetcd100{1,2,3} and neon.eqiad.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/358344 (https://phabricator.wikimedia.org/T162045) [11:32:13] (03CR) 10Alexandros Kosiaris: [V: 032 C: 032] Introduce kubestagetcd100{1,2,3} and neon.eqiad.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/358344 (https://phabricator.wikimedia.org/T162045) (owner: 10Alexandros Kosiaris) [11:41:34] !log upgrading HHVM on tin/naos to HHVM 3.18 [11:41:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:41:43] \o/ [11:46:19] (03PS2) 10Muehlenhoff: Use ffmpeg from jessie-backports on jessie-based video scalers [puppet] - 10https://gerrit.wikimedia.org/r/358381 (https://phabricator.wikimedia.org/T145742) [11:49:26] (03PS1) 10KartikMistry: apertium: Update package list [puppet] - 10https://gerrit.wikimedia.org/r/358575 (https://phabricator.wikimedia.org/T167247) [11:50:34] (03CR) 10Muehlenhoff: [C: 032] Use ffmpeg from jessie-backports on jessie-based video scalers [puppet] - 10https://gerrit.wikimedia.org/r/358381 (https://phabricator.wikimedia.org/T145742) (owner: 10Muehlenhoff) [11:50:45] (03PS2) 10Filippo Giunchedi: Deploy Thumbor to group0 wikis [puppet] - 10https://gerrit.wikimedia.org/r/358551 (https://phabricator.wikimedia.org/T167782) (owner: 10Gilles) [11:52:37] (03CR) 10Filippo Giunchedi: [C: 032] Deploy Thumbor to group0 wikis [puppet] - 10https://gerrit.wikimedia.org/r/358551 (https://phabricator.wikimedia.org/T167782) (owner: 10Gilles) [11:56:11] !log enable thumbor serving for group0 wikis with media files - T167782 [11:56:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:56:20] T167782: Deploy Thumbor to group0 wikis - https://phabricator.wikimedia.org/T167782 [11:56:23] gilles: ^ [11:56:57] godog: let me know when the swift proxies have been restarted [11:57:35] gilles: yup restarted already [11:58:04] (03CR) 10Thiemo Mättig (WMDE): [C: 031] Configure WikibaseQualityConstraints extension (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/358553 (owner: 10Lucas Werkmeister (WMDE)) [11:58:31] godog: doesn't look like it's applied to mediawiki.org, maybe I got the container name wrong [11:58:36] I remember there were some exceptions there [11:59:21] (03CR) 10Lucas Werkmeister (WMDE): "I’m also super happy that the propertySet type exists and that it’s not tied to the dataType :)" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/358343 (https://phabricator.wikimedia.org/T167126) (owner: 10Lucas Werkmeister (WMDE)) [11:59:22] could be wikimedia-mediawiki or something [12:00:18] could be, checking [12:04:28] wikipedia-mediawiki-local-public is the e.g. the public container, should be correct [12:05:59] (03PS1) 10Marostegui: Revert "db-eqiad.php: Depool db1064" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/358577 [12:06:04] (03PS2) 10Marostegui: Revert "db-eqiad.php: Depool db1064" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/358577 [12:07:43] PROBLEM - puppet last run on labvirt1010 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [12:08:20] (03CR) 10Marostegui: [C: 032] Revert "db-eqiad.php: Depool db1064" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/358577 (owner: 10Marostegui) [12:09:44] (03Merged) 10jenkins-bot: Revert "db-eqiad.php: Depool db1064" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/358577 (owner: 10Marostegui) [12:09:54] (03CR) 10jenkins-bot: Revert "db-eqiad.php: Depool db1064" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/358577 (owner: 10Marostegui) [12:11:13] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Repool db1064 - T166206 (duration: 00m 51s) [12:11:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:11:23] T166206: Convert unique keys into primary keys for some wiki tables on s4 - https://phabricator.wikimedia.org/T166206 [12:11:32] gilles: ah yeah lang-proj in mediawiki case is www-mediawiki [12:11:59] ok I'll fix that after the meeting I'm in [12:12:35] kk, I'll finish reimaging the last swift codfw trusty host \o/ [12:12:52] !log Deploy alter table on s2 on dbstore1002 - T166205 [12:12:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:13:00] T166205: Convert unique keys into primary keys for some wiki tables on s2 - https://phabricator.wikimedia.org/T166205 [12:22:21] (03CR) 10Jcrespo: "I'd suggest to put db1076 to load 1, to avoid problems in case of overload." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/358529 (https://phabricator.wikimedia.org/T166205) (owner: 10Marostegui) [12:23:06] (03PS2) 10Marostegui: db-eqiad.php: Depool db1036 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/358529 (https://phabricator.wikimedia.org/T166205) [12:31:30] (03PS1) 10Aude: Enable Wikidata echo notifications for all wikis (except enwiki, frwiki, dewiki) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/358580 (https://phabricator.wikimedia.org/T142102) [12:33:34] (03PS3) 10Marostegui: db-eqiad.php: Depool db1036 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/358529 (https://phabricator.wikimedia.org/T166205) [12:33:39] jouncebot: refres [12:33:40] jouncebot: refresh [12:33:42] I refreshed my knowledge about deployments. [12:33:56] jouncebot: next [12:33:56] In 0 hour(s) and 26 minute(s): European Mid-day SWAT(Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170613T1300) [12:34:36] (03CR) 10KartikMistry: "recheck" [debs/contenttranslation/apertium-spa-ita] - 10https://gerrit.wikimedia.org/r/358570 (https://phabricator.wikimedia.org/T167247) (owner: 10KartikMistry) [12:35:09] (03CR) 10jerkins-bot: [V: 04-1] apertium-spa-ita: New upstream release [debs/contenttranslation/apertium-spa-ita] - 10https://gerrit.wikimedia.org/r/358570 (https://phabricator.wikimedia.org/T167247) (owner: 10KartikMistry) [12:36:43] RECOVERY - puppet last run on labvirt1010 is OK: OK: Puppet is currently enabled, last run 42 seconds ago with 0 failures [12:36:59] (03PS2) 10KartikMistry: apertium-spa-ita: New upstream release [debs/contenttranslation/apertium-spa-ita] - 10https://gerrit.wikimedia.org/r/358570 (https://phabricator.wikimedia.org/T167247) [12:37:18] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Depool db1036 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/358529 (https://phabricator.wikimedia.org/T166205) (owner: 10Marostegui) [12:37:24] (03CR) 10jerkins-bot: [V: 04-1] apertium-spa-ita: New upstream release [debs/contenttranslation/apertium-spa-ita] - 10https://gerrit.wikimedia.org/r/358570 (https://phabricator.wikimedia.org/T167247) (owner: 10KartikMistry) [12:38:16] (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1036 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/358529 (https://phabricator.wikimedia.org/T166205) (owner: 10Marostegui) [12:38:25] (03CR) 10jenkins-bot: db-eqiad.php: Depool db1036 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/358529 (https://phabricator.wikimedia.org/T166205) (owner: 10Marostegui) [12:39:00] Amir1: phuedx: I have CR+2 your patches for the swat so we can deploy them right when the window begins [12:39:11] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Depool db1036 - T166205 (duration: 00m 41s) [12:39:19] hey, what's up? [12:39:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:39:21] T166205: Convert unique keys into primary keys for some wiki tables on s2 - https://phabricator.wikimedia.org/T166205 [12:39:38] ohhhh [12:39:40] great [12:39:43] Thanks [12:39:45] hashar: i'd like to add https://gerrit.wikimedia.org/r/#/c/358580/ (config patch) [12:39:52] aude: please do :) [12:39:56] ok [12:43:03] (03CR) 10Hashar: "There is some slight differences in the dblists which results in the removal of that feature for a few wikis:" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/358580 (https://phabricator.wikimedia.org/T142102) (owner: 10Aude) [12:43:20] aude: that drop the feature from arbcom wikis, tenwiki and wg_enwiki [12:43:23] no clue what those are [12:43:46] looks like they are closed [12:44:08] (03CR) 10Aude: "that's ok. (those wikis don't have wikibase client. having the config there before didn't do anything on those wikis and they don't need t" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/358580 (https://phabricator.wikimedia.org/T142102) (owner: 10Aude) [12:44:23] good :] [12:44:24] they don't have wikibase so ok [12:44:30] !log Deploy alter table on s2 on db1036 - T166205 [12:44:32] lets deploy it in 16 minutes :) [12:44:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:44:40] T166205: Convert unique keys into primary keys for some wiki tables on s2 - https://phabricator.wikimedia.org/T166205 [12:44:40] didn't matter that they had the config (in the last 2 weeks) [12:44:54] but thanks for checking :) [12:48:30] 10Operations, 10Deployment-Systems: Have fallback communication channel when freenode has problems - https://phabricator.wikimedia.org/T127904#3344510 (10Marostegui) p:05High>03Normal [12:56:24] (03PS1) 10Ema: VCL: rate limit API requests [puppet] - 10https://gerrit.wikimedia.org/r/358583 (https://phabricator.wikimedia.org/T163233) [12:56:59] hashar: awesome! thanks! [13:00:04] addshore, hashar, anomie, RainbowSprinkles, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, and thcipriani: Respected human, time to deploy European Mid-day SWAT(Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170613T1300). Please do the needful. [13:00:04] Amir1, phuedx, and aude: A patch you scheduled for European Mid-day SWAT(Max 8 patches) is about to be deployed. Please be available during the process. [13:00:13] o/ [13:00:18] o/ [13:02:23] lets do aude config patch [13:02:26] it is straight forward [13:02:36] (03CR) 10Hashar: [C: 032] "Thanks! SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/358580 (https://phabricator.wikimedia.org/T142102) (owner: 10Aude) [13:03:14] ok [13:03:24] (03CR) 10Dereckson: [C: 031] "ID are consistent with the ones reported on the task." [puppet] - 10https://gerrit.wikimedia.org/r/358500 (https://phabricator.wikimedia.org/T165733) (owner: 10Dzahn) [13:03:37] (03Merged) 10jenkins-bot: Enable Wikidata echo notifications for all wikis (except enwiki, frwiki, dewiki) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/358580 (https://phabricator.wikimedia.org/T142102) (owner: 10Aude) [13:04:28] Amir1: you are next [13:04:40] yes, thanks [13:04:43] Amir1: not that https://gerrit.wikimedia.org/r/#/c/358507/1/includes/htmlform/OOUIHTMLForm.php does not take in account $elements = array(); [13:04:55] !log hashar@tin Synchronized wmf-config/Wikibase-production.php: Enable Wikidata echo notifications for all wikis (except enwiki, frwiki, dewiki) - T142102 (duration: 00m 42s) [13:04:58] aude: done [13:05:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:05:03] T142102: [Story] Deploy Wikibase notifications to Wikimedia projects - https://phabricator.wikimedia.org/T142102 [13:05:14] thanks [13:05:29] hashar: I will look into it soon [13:05:51] (03CR) 10jenkins-bot: Enable Wikidata echo notifications for all wikis (except enwiki, frwiki, dewiki) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/358580 (https://phabricator.wikimedia.org/T142102) (owner: 10Aude) [13:05:56] phuedx: Amir1: your patches are on mwdebug1001 / mwdebug1002 [13:06:14] hmm not phuedx one sorry [13:06:25] hashar: works for me [13:06:40] Amir1: syncing [13:06:52] phuedx: your patch is on mwdebug1001 / mwdebug1002 (did just a scap pull) [13:07:03] hashar: ta [13:08:07] !log hashar@tin Synchronized php-1.30.0-wmf.4/includes/htmlform/OOUIHTMLForm.php: Do not try to parse empty argument in getErrorsOrWarnings in OOUI - T167644 (duration: 00m 41s) [13:08:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:08:16] T167644: Client special pages don't show expected interface messages - https://phabricator.wikimedia.org/T167644 [13:08:19] Amir1: done :] [13:08:42] works just fine, thanks [13:08:55] PROBLEM - Disk space on copper is CRITICAL: DISK CRITICAL - free space: / 1712 MB (3% inode=40%): /srv/docker-dm 1712 MB (3% inode=40%) [13:09:41] ^ that's me, freeing some old builds [13:09:45] ah ok :) [13:09:46] thanks [13:10:49] (03PS1) 10Gilles: Fix mediawiki.org container name for Thumbor config [puppet] - 10https://gerrit.wikimedia.org/r/358591 (https://phabricator.wikimedia.org/T167782) [13:11:39] hashar: o/ wfm [13:11:40] thanks [13:11:55] RECOVERY - Disk space on copper is OK: DISK OK [13:13:05] phuedx: syncing :) [13:13:40] !log hashar@tin Synchronized php-1.30.0-wmf.4/extensions/Popups: actions/rest: Use DB-key version of title - T167633 (duration: 00m 41s) [13:13:48] (03CR) 10Filippo Giunchedi: [C: 032] Fix mediawiki.org container name for Thumbor config [puppet] - 10https://gerrit.wikimedia.org/r/358591 (https://phabricator.wikimedia.org/T167782) (owner: 10Gilles) [13:13:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:13:50] T167633: Regression: Page Previews not using normalised title in requests - https://phabricator.wikimedia.org/T167633 [13:15:13] 10Operations, 10Recommendation-API, 10Service-deployment-requests, 10Services (doing), 10User-mobrovac: New Service Request: recommendation-api - https://phabricator.wikimedia.org/T167664#3344553 (10Marostegui) p:05Triage>03Normal [13:15:30] (03CR) 10Ema: "This seems to be puppetfailing at the moment: https://puppet-compiler.wmflabs.org/6753/" [puppet] - 10https://gerrit.wikimedia.org/r/357844 (owner: 10BBlack) [13:15:44] !log European SWAT completed [13:15:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:19:07] (03CR) 10Ema: [C: 031] numa_networking: add facter data from sysfs [puppet] - 10https://gerrit.wikimedia.org/r/355809 (owner: 10BBlack) [13:25:43] PROBLEM - Check Varnish expiry mailbox lag on cp1049 is CRITICAL: CRITICAL: expiry mailbox lag is 2107965 [13:26:23] PROBLEM - MegaRAID on analytics1067 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough [13:27:08] elukey: ^ [13:27:31] non repeating message :-P [13:27:49] (03PS1) 10Gilles: Deploy Thumbor to group1 wikis + mediawiki.org [puppet] - 10https://gerrit.wikimedia.org/r/358598 (https://phabricator.wikimedia.org/T167782) [13:27:53] (03PS9) 10BBlack: numa_networking: add facter data from sysfs [puppet] - 10https://gerrit.wikimedia.org/r/355809 [13:27:55] (03PS10) 10BBlack: numa_networking: support NUMA in interface::rps [puppet] - 10https://gerrit.wikimedia.org/r/355810 [13:27:57] (03PS10) 10BBlack: numa_networking: support NUMA in tlsproxy nginx config [puppet] - 10https://gerrit.wikimedia.org/r/355811 [13:27:59] (03PS3) 10BBlack: numa_networking: test enable on cp4021 [puppet] - 10https://gerrit.wikimedia.org/r/357844 [13:28:01] (03PS3) 10BBlack: numa_networking: remove install-time bnx2x stuff [puppet] - 10https://gerrit.wikimedia.org/r/357850 [13:29:03] the BBU is dead [13:29:11] is analytics1067 using WT by any chance? :P [13:29:14] (03CR) 10Filippo Giunchedi: [C: 032] Deploy Thumbor to group1 wikis + mediawiki.org [puppet] - 10https://gerrit.wikimedia.org/r/358598 (https://phabricator.wikimedia.org/T167782) (owner: 10Gilles) [13:29:26] nice though! [13:29:34] it is one of the newer boxes [13:29:38] really happy about it [13:29:53] and no automated task on purpose, it was asked to skip those ones [13:30:00] ones=specific error [13:30:02] (03CR) 10BBlack: [C: 032] numa_networking: add facter data from sysfs [puppet] - 10https://gerrit.wikimedia.org/r/355809 (owner: 10BBlack) [13:30:08] nooooo I was about to ask you to force you to work [13:30:14] (03PS10) 10BBlack: numa_networking: add facter data from sysfs [puppet] - 10https://gerrit.wikimedia.org/r/355809 [13:30:17] (03CR) 10BBlack: [V: 032 C: 032] numa_networking: add facter data from sysfs [puppet] - 10https://gerrit.wikimedia.org/r/355809 (owner: 10BBlack) [13:30:19] :P [13:30:32] like: Riccardo wouldn't be great if... [13:30:33] hahaha [13:30:43] !log Thumbor to group1 wikis + mediawiki.org - T167793 [13:30:43] 10Operations, 10ops-eqiad, 10Analytics: analytics1067: Broken BBU - https://phabricator.wikimedia.org/T167797#3344626 (10Marostegui) [13:30:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:30:52] T167793: Deploy Thumbor to group1 wikis - https://phabricator.wikimedia.org/T167793 [13:30:53] lol, manipulator [13:31:00] 10Operations, 10ops-eqiad, 10Analytics: analytics1067: Broken BBU - https://phabricator.wikimedia.org/T167797#3344640 (10Marostegui) p:05Triage>03Normal [13:31:16] * elukey sends wikilove to marostegui [13:31:34] \o/ [13:31:37] note that the check [13:31:45] is open for patches for improvement [13:31:57] like X(repeated X times) [13:32:08] I did the minimum to support one LD [13:32:15] because that is all we needed [13:32:33] but I recognize it can be annoying with multiple LDs [13:34:04] (03PS4) 10Andrew Bogott: wmfsink: Clean up proxy records for deleted instances. [puppet] - 10https://gerrit.wikimedia.org/r/358399 (https://phabricator.wikimedia.org/T163765) [13:34:13] 10Operations, 10ops-eqiad, 10Analytics: analytics1067: Broken BBU - https://phabricator.wikimedia.org/T167797#3344646 (10elukey) @Cmjohnson this host is one of the last batch (so under warranty for sure), can you order a new BBU whenever you have time? [13:34:57] it might be something else though, it's unlikely that is a dead battery [13:35:35] maybe was not well mounted/fixed in place and just moved? :D [13:35:36] 10Operations, 10ops-eqiad, 10Analytics: analytics1067: Broken BBU - https://phabricator.wikimedia.org/T167797#3344650 (10Marostegui) I have not forced the RAID to go to WB, I would leave that to #analytics. If needed, this should be it: ``` megacli -LDSetProp -ForcedWB -Immediate -Lall -aAll ``` And to re... [13:36:44] (03CR) 10Andrew Bogott: [C: 032] wmfsink: Clean up proxy records for deleted instances. [puppet] - 10https://gerrit.wikimedia.org/r/358399 (https://phabricator.wikimedia.org/T163765) (owner: 10Andrew Bogott) [13:43:17] (03PS3) 10Rush: nodepool: lower min-ready for trusty [puppet] - 10https://gerrit.wikimedia.org/r/356466 (owner: 10Hashar) [13:45:12] (03CR) 10Jcrespo: [C: 04-1] "For starters, I may be wrong, but I think this assumes the policy is exclusive between those 6 keywords, when it actually is a combination" [puppet] - 10https://gerrit.wikimedia.org/r/357994 (https://phabricator.wikimedia.org/T166108) (owner: 10Faidon Liambotis) [13:46:21] (03PS5) 10Gehel: Add Shiny Server module and Discovery Dashboards role/profile [puppet] - 10https://gerrit.wikimedia.org/r/353571 (https://phabricator.wikimedia.org/T161354) (owner: 10Bearloga) [13:46:28] (03CR) 10Rush: [C: 032] nodepool: lower min-ready for trusty [puppet] - 10https://gerrit.wikimedia.org/r/356466 (owner: 10Hashar) [13:52:07] (03PS1) 10Filippo Giunchedi: hieradata: use thumbor.svc for codfw too [puppet] - 10https://gerrit.wikimedia.org/r/358599 (https://phabricator.wikimedia.org/T121388) [13:56:06] (03CR) 10Gilles: [C: 031] hieradata: use thumbor.svc for codfw too [puppet] - 10https://gerrit.wikimedia.org/r/358599 (https://phabricator.wikimedia.org/T121388) (owner: 10Filippo Giunchedi) [13:56:33] 10Operations, 10Performance-Team, 10Thumbor, 10User-fgiunchedi: Deploy thumbor in codfw - https://phabricator.wikimedia.org/T167801#3344747 (10fgiunchedi) [13:57:23] (03CR) 10Filippo Giunchedi: [C: 032] hieradata: use thumbor.svc for codfw too [puppet] - 10https://gerrit.wikimedia.org/r/358599 (https://phabricator.wikimedia.org/T121388) (owner: 10Filippo Giunchedi) [13:57:43] 10Operations, 10ops-eqiad, 10User-Joe: rack/setup/install conf1004-conf1006 - https://phabricator.wikimedia.org/T166081#3344763 (10Cmjohnson) [13:58:11] 10Operations, 10ops-eqiad, 10User-Joe: rack/setup/install conf1004-conf1006 - https://phabricator.wikimedia.org/T166081#3284834 (10Cmjohnson) Racked these in A4/B4/D4. Updated racktables w/basic info and rack location. [14:00:02] (03PS1) 10Andrew Bogott: wmf_sink: Bugfixes for proxy cleanup [puppet] - 10https://gerrit.wikimedia.org/r/358600 [14:01:36] (03CR) 10Gergő Tisza: [C: 031] "OTOH this way if template hacks are replaced by Babel on one of the projects with non-Babel categories, it will start working seamlessly." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/358007 (owner: 10Amire80) [14:02:04] (03CR) 10Andrew Bogott: [C: 032] wmf_sink: Bugfixes for proxy cleanup [puppet] - 10https://gerrit.wikimedia.org/r/358600 (owner: 10Andrew Bogott) [14:03:05] 10Operations, 10ops-codfw, 10hardware-requests, 10User-fgiunchedi: Decommission ms-be2001 - ms-be2012 - https://phabricator.wikimedia.org/T162785#3344802 (10Papaul) @Marostegui There are other steps that need to be done before this task can be assigned to me. [14:03:21] (03PS1) 10Hashar: nodepool: lower rate of queries from 6 to 5 [puppet] - 10https://gerrit.wikimedia.org/r/358601 (https://phabricator.wikimedia.org/T167803) [14:03:50] (03PS1) 10Alexandros Kosiaris: Specify correct hostname in DHCP config [puppet] - 10https://gerrit.wikimedia.org/r/358602 [14:04:02] (03CR) 10Hashar: "This need validation by cloud teams (Andrew && Chase). Best done on a Monday." [puppet] - 10https://gerrit.wikimedia.org/r/358601 (https://phabricator.wikimedia.org/T167803) (owner: 10Hashar) [14:05:19] (03CR) 10Alexandros Kosiaris: [C: 032] Specify correct hostname in DHCP config [puppet] - 10https://gerrit.wikimedia.org/r/358602 (owner: 10Alexandros Kosiaris) [14:05:37] (03PS2) 10Alexandros Kosiaris: Specify correct hostname in DHCP config [puppet] - 10https://gerrit.wikimedia.org/r/358602 [14:05:39] (03PS2) 10Amire80: Sort wmgBabelMainCategory alphabetically [mediawiki-config] - 10https://gerrit.wikimedia.org/r/358006 [14:05:42] (03PS2) 10Amire80: Add wmgBabelMainCategory for many languages [mediawiki-config] - 10https://gerrit.wikimedia.org/r/358007 [14:06:23] RECOVERY - MegaRAID on analytics1067 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy [14:07:26] elukey, marostegui ^^^ false contact of the battery maybe? [14:08:31] 10Operations, 10ops-codfw, 10hardware-requests, 10User-fgiunchedi: Decommission ms-be2001 - ms-be2012 - https://phabricator.wikimedia.org/T162785#3344817 (10Marostegui) Ah, ok - sorry @Papaul - I thought you'd take over from the non interrupptable section as stated here: https://wikitech.wikimedia.org/wiki... [14:08:33] could be [14:08:33] 10Operations, 10ops-codfw, 10hardware-requests, 10User-fgiunchedi: Decommission ms-be2001 - ms-be2012 - https://phabricator.wikimedia.org/T162785#3344816 (10Marostegui) Ah, ok - sorry @Papaul - I thought you'd take over from the non interrupptable section as stated here: https://wikitech.wikimedia.org/wiki... [14:10:22] 10Operations, 10ops-eqiad, 10Analytics: analytics1067: Broken BBU - https://phabricator.wikimedia.org/T167797#3344826 (10Marostegui) 05Open>03Resolved Looks like it recovered itself: ``` ˜/icinga-wm 16:06> RECOVERY - MegaRAID on analytics1067 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy... [14:10:52] !log T164865: Restart RESTBase dev; apply range delete probability of 1.0 [14:11:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:11:02] (03PS2) 10Filippo Giunchedi: hieradata: swift temporary a/a [puppet] - 10https://gerrit.wikimedia.org/r/358376 (https://phabricator.wikimedia.org/T162609) [14:11:02] T164865: Prototype and test range delete-based current revision storage - https://phabricator.wikimedia.org/T164865 [14:11:20] (03CR) 10BBlack: [C: 031] hierata: swift active in codfw only [puppet] - 10https://gerrit.wikimedia.org/r/358377 (https://phabricator.wikimedia.org/T162609) (owner: 10Filippo Giunchedi) [14:11:25] !log upgrading mw1299-mw1306 to HHVM 3.18 [14:11:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:11:45] (03CR) 10BBlack: [C: 031] hieradata: swift temporary a/a [puppet] - 10https://gerrit.wikimedia.org/r/358376 (https://phabricator.wikimedia.org/T162609) (owner: 10Filippo Giunchedi) [14:12:28] (03CR) 10Filippo Giunchedi: [C: 032] hieradata: swift temporary a/a [puppet] - 10https://gerrit.wikimedia.org/r/358376 (https://phabricator.wikimedia.org/T162609) (owner: 10Filippo Giunchedi) [14:13:06] (03CR) 10Ottomata: [C: 031] "I just verified some things, and this should be good to go! Will merge shortly..." [puppet] - 10https://gerrit.wikimedia.org/r/253474 (https://phabricator.wikimedia.org/T118557) (owner: 10BBlack) [14:14:03] !log point upload varnish to swift in codfw - T162609 [14:14:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:14:11] T162609: Swift version and distro upgrade - https://phabricator.wikimedia.org/T162609 [14:14:34] 10Operations, 10Analytics-Kanban, 10Traffic, 10Patch-For-Review: Replace Analytics XFF/client.ip data with X-Client-IP - https://phabricator.wikimedia.org/T118557#3344847 (10Ottomata) Ooook, I just checked some things. - x_forwarded_for was only being used by legacy pageview code in refinery, which itself... [14:14:51] (03PS2) 10Filippo Giunchedi: hierata: swift active in codfw only [puppet] - 10https://gerrit.wikimedia.org/r/358377 (https://phabricator.wikimedia.org/T162609) [14:15:49] 10Operations, 10ops-codfw, 10DBA: Degraded RAID on db2070 - https://phabricator.wikimedia.org/T167667#3344848 (10Papaul) Dear Mr Papaul Tshibamba, Thank you for contacting Hewlett Packard Enterprise for your service request. This email confirms your request for service and the details are below. Your reque... [14:15:50] (03CR) 10Ottomata: [C: 031] "Actually, just to be super sure, I'm going to stop puppet on a misc varnish, and apply this change there manually. Then we'll let an hour " [puppet] - 10https://gerrit.wikimedia.org/r/253474 (https://phabricator.wikimedia.org/T118557) (owner: 10BBlack) [14:16:06] 10Operations, 10ops-codfw, 10DBA: Degraded RAID on db2070 - https://phabricator.wikimedia.org/T167667#3344851 (10Marostegui) Thank you!! [14:16:20] (03CR) 10Filippo Giunchedi: [C: 032] hierata: swift active in codfw only [puppet] - 10https://gerrit.wikimedia.org/r/358377 (https://phabricator.wikimedia.org/T162609) (owner: 10Filippo Giunchedi) [14:17:32] !log stopping puppet on cp1045, testing removal of xff from varnishkafka webrequest data [14:17:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:18:38] godog: FYI ^^^ I guess you're running puppet there too to make the change for swift applied [14:19:24] volans: thanks! no though, only upload in my case [14:19:32] but cumin is already done anyways [14:19:36] (03PS3) 10Ottomata: webrequest: remove X-Forwarded-For [puppet] - 10https://gerrit.wikimedia.org/r/253474 (https://phabricator.wikimedia.org/T118557) (owner: 10BBlack) [14:20:00] great :D sorry but my brain refuses to memorize the cp-group matching :D [14:20:45] heheh yeah I don't know it either, I checked the host list [14:22:28] !log restarting elasticsearch on relforge to validate GC configuration - T167636 [14:22:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:22:37] T167636: Investigate perf regression after elasticsearch 5.3.2 deployment - https://phabricator.wikimedia.org/T167636 [14:31:34] (03PS2) 10Ema: VCL: rate limit text-frontend requests [puppet] - 10https://gerrit.wikimedia.org/r/358583 (https://phabricator.wikimedia.org/T163233) [14:35:46] (03PS2) 10Herron: Add logrotate template to retain 60 days of exim mx logs [puppet] - 10https://gerrit.wikimedia.org/r/357723 [14:36:53] (03CR) 10Faidon Liambotis: "It's a define, not a class, so no, it's not exclusive. You can do" [puppet] - 10https://gerrit.wikimedia.org/r/357994 (https://phabricator.wikimedia.org/T166108) (owner: 10Faidon Liambotis) [14:37:48] (03CR) 10Alexandros Kosiaris: [C: 032] apertium-fra: New upstream release [debs/contenttranslation/apertium-fra] - 10https://gerrit.wikimedia.org/r/358366 (https://phabricator.wikimedia.org/T167247) (owner: 10KartikMistry) [14:38:15] (03PS3) 10Herron: Add logrotate template to retain 60 days of exim mx logs [puppet] - 10https://gerrit.wikimedia.org/r/357723 (https://phabricator.wikimedia.org/T167333) [14:38:36] 10Operations, 10ops-codfw, 10DBA: Degraded RAID on db2070 - https://phabricator.wikimedia.org/T167667#3344895 (10Papaul) After creating the case the HP gay is telling me that he can't not place the order because there is an hold on the server and can not tell me what the hold is that another team will contac... [14:40:01] (03CR) 10Alexandros Kosiaris: [C: 04-1] apertium-cat: New upstream release (031 comment) [debs/contenttranslation/apertium-cat] - 10https://gerrit.wikimedia.org/r/358354 (https://phabricator.wikimedia.org/T167247) (owner: 10KartikMistry) [14:40:19] (03CR) 10Alexandros Kosiaris: [C: 032] apertium-ita: New upstream release [debs/contenttranslation/apertium-ita] - 10https://gerrit.wikimedia.org/r/358515 (https://phabricator.wikimedia.org/T167247) (owner: 10KartikMistry) [14:40:31] (03PS1) 10Faidon Liambotis: smokeping: s/FRack/frack/ [puppet] - 10https://gerrit.wikimedia.org/r/358608 [14:40:38] (03CR) 10BBlack: [C: 031] "analytics data indicates that 300/60s should cover the full request load for the worst single client IP we can see in the past week (an MS" [puppet] - 10https://gerrit.wikimedia.org/r/358583 (https://phabricator.wikimedia.org/T163233) (owner: 10Ema) [14:40:40] (03CR) 10Alexandros Kosiaris: [C: 032] apertium-spa: New upstream release [debs/contenttranslation/apertium-spa] - 10https://gerrit.wikimedia.org/r/358512 (https://phabricator.wikimedia.org/T167247) (owner: 10KartikMistry) [14:41:06] (03CR) 10Jcrespo: [C: 04-1] "> It's a define, not a class, so no, it's not exclusive. You can do" [puppet] - 10https://gerrit.wikimedia.org/r/357994 (https://phabricator.wikimedia.org/T166108) (owner: 10Faidon Liambotis) [14:41:08] (03CR) 10Faidon Liambotis: [V: 032 C: 032] smokeping: s/FRack/frack/ [puppet] - 10https://gerrit.wikimedia.org/r/358608 (owner: 10Faidon Liambotis) [14:43:14] (03PS1) 10Filippo Giunchedi: hieradata: point varnish upload esams to codfw [puppet] - 10https://gerrit.wikimedia.org/r/358609 (https://phabricator.wikimedia.org/T162609) [14:44:53] (03PS4) 10Herron: Add logrotate template to retain 60 days of exim mx logs [puppet] - 10https://gerrit.wikimedia.org/r/357723 (https://phabricator.wikimedia.org/T167333) [14:45:07] Does anyone know who maintains ircecho code and/or where or if theres documentation for it? [14:45:43] RECOVERY - Check Varnish expiry mailbox lag on cp1049 is OK: OK: expiry mailbox lag is 127 [14:47:02] bblack ema I forgot inter-cache routing, only about 200 requests/s are hitting swift codfw https://gerrit.wikimedia.org/r/#/c/358609 [14:47:06] (03CR) 10Marostegui: "I am not completely sure we want to handle the RAID policies from puppet directly. It should be safe, but as Jaime said, we are normally q" [puppet] - 10https://gerrit.wikimedia.org/r/357994 (https://phabricator.wikimedia.org/T166108) (owner: 10Faidon Liambotis) [14:48:34] godog: if you want a bunch of load, another option is to shut off the eqiad->swift link and it will all go codfgw [14:48:40] godog: but really either way works [14:49:42] bblack: yup that would work for me too, I went with the above since that's what we do for the switchover [14:50:09] (03CR) 10BBlack: [C: 031] hieradata: point varnish upload esams to codfw [puppet] - 10https://gerrit.wikimedia.org/r/358609 (https://phabricator.wikimedia.org/T162609) (owner: 10Filippo Giunchedi) [14:50:46] thanks! [14:50:46] (03CR) 10Filippo Giunchedi: [C: 032] hieradata: point varnish upload esams to codfw [puppet] - 10https://gerrit.wikimedia.org/r/358609 (https://phabricator.wikimedia.org/T162609) (owner: 10Filippo Giunchedi) [14:50:48] (03PS2) 10Filippo Giunchedi: hieradata: point varnish upload esams to codfw [puppet] - 10https://gerrit.wikimedia.org/r/358609 (https://phabricator.wikimedia.org/T162609) [14:51:41] (03CR) 10Filippo Giunchedi: [V: 032 C: 032] hieradata: point varnish upload esams to codfw [puppet] - 10https://gerrit.wikimedia.org/r/358609 (https://phabricator.wikimedia.org/T162609) (owner: 10Filippo Giunchedi) [14:52:03] (03CR) 10Herron: "Thanks for the feedback. Commit message has been updated (after a few hiccups). Body seems to display better in Gerrit when lines are un" [puppet] - 10https://gerrit.wikimedia.org/r/357723 (https://phabricator.wikimedia.org/T167333) (owner: 10Herron) [14:53:05] !log update inter-routing for upload to point esams to codfw [14:53:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:00:53] (03PS3) 10Ema: VCL: rate limit text-frontend requests [puppet] - 10https://gerrit.wikimedia.org/r/358583 (https://phabricator.wikimedia.org/T163233) [15:03:53] 10Operations, 10DBA, 10Patch-For-Review: eqiad rack/setup 11 new DB servers - https://phabricator.wikimedia.org/T162233#3344966 (10Marostegui) I just checked db1098 for instance and it still has the same issue indeed so I assume the other ones will remain with the same issue. [15:03:59] 10Operations, 10ops-eqiad: rack/setup/install ores100[1-9] - https://phabricator.wikimedia.org/T167808#3344967 (10RobH) [15:04:53] (03PS5) 10KartikMistry: Update apertium-cat package [debs/contenttranslation/apertium-cat] - 10https://gerrit.wikimedia.org/r/358354 (https://phabricator.wikimedia.org/T167247) [15:05:01] (03CR) 10KartikMistry: Update apertium-cat package (031 comment) [debs/contenttranslation/apertium-cat] - 10https://gerrit.wikimedia.org/r/358354 (https://phabricator.wikimedia.org/T167247) (owner: 10KartikMistry) [15:09:12] !log applying new GC configuration on elastic1018 - T167636 [15:09:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:09:21] T167636: Investigate perf regression after elasticsearch 5.3.2 deployment - https://phabricator.wikimedia.org/T167636 [15:09:25] (03PS6) 10BryanDavis: bd808's dotfiles [puppet] - 10https://gerrit.wikimedia.org/r/353937 [15:13:39] (03CR) 10Alexandros Kosiaris: [C: 032] Update apertium-cat package [debs/contenttranslation/apertium-cat] - 10https://gerrit.wikimedia.org/r/358354 (https://phabricator.wikimedia.org/T167247) (owner: 10KartikMistry) [15:13:50] (03PS1) 10Filippo Giunchedi: hieradata: a/a for varnish swift_thumbs [puppet] - 10https://gerrit.wikimedia.org/r/358611 (https://phabricator.wikimedia.org/T162609) [15:13:52] (03PS1) 10Filippo Giunchedi: hieradata: a/p for varnish swift_thumbs [puppet] - 10https://gerrit.wikimedia.org/r/358612 (https://phabricator.wikimedia.org/T162609) [15:15:26] 10Operations, 10ops-codfw, 10hardware-requests, 10User-fgiunchedi: Decommission ms-be2001 - ms-be2012 - https://phabricator.wikimedia.org/T162785#3345009 (10RobH) a:05Papaul>03RobH I can take this over from here to the onsite wipe stage. Thamks for doing all the steps up to the non-interrupt @Marostegui! [15:15:37] last two patches to fully switch swift to codfw, so far looks good [15:16:24] 10Operations, 10ops-codfw, 10hardware-requests, 10User-fgiunchedi: Decommission ms-be2001 - ms-be2012 - https://phabricator.wikimedia.org/T162785#3345016 (10Marostegui) Thanks Rob! [15:16:40] bd808: heh, thanks :) [15:16:41] godog: great! what about swiftrepl, is it stopped? [15:17:04] paravoid: it was a giant pile of things [15:17:24] yeah, always amazes me when people are so particular about their setup :) [15:17:29] ori is too [15:17:34] I don't even have dotfiles checked in :) [15:17:37] 10Operations, 10Analytics-Kanban, 10Traffic, 10Patch-For-Review: Replace Analytics XFF/client.ip data with X-Client-IP - https://phabricator.wikimedia.org/T118557#3345019 (10Nuria) a:03Ottomata [15:17:56] volans: it is yeah, but it isn't related to this change since we're not touching what mw thinks of primary/secondary location [15:18:08] so my only (minor) comment on the updated PS is that some of them are third-party and not accompanied by copyright/license [15:18:12] my first draft there was less than half of the things in my private dotfile repo ;) [15:18:21] most vim plugins do have a copyright/license header, but not all [15:19:10] (03CR) 10Filippo Giunchedi: [C: 032] hieradata: a/a for varnish swift_thumbs [puppet] - 10https://gerrit.wikimedia.org/r/358611 (https://phabricator.wikimedia.org/T162609) (owner: 10Filippo Giunchedi) [15:19:20] hmmm.. ok. I'll see if I can track down those plugins. I don't think I pruned LICENSE files at any point but maybe I can tell from the real upstream [15:19:48] sorry, I realize this is a bit of grunt work :/ [15:20:24] (03CR) 10BBlack: [C: 031] VCL: rate limit text-frontend requests [puppet] - 10https://gerrit.wikimedia.org/r/358583 (https://phabricator.wikimedia.org/T163233) (owner: 10Ema) [15:20:59] godog: right [15:22:17] (03CR) 10Filippo Giunchedi: [C: 032] hieradata: a/p for varnish swift_thumbs [puppet] - 10https://gerrit.wikimedia.org/r/358612 (https://phabricator.wikimedia.org/T162609) (owner: 10Filippo Giunchedi) [15:22:41] elukey: Now analytics1063 is complaining about the BBU [15:22:46] is that one new too? [15:23:42] Ah, that one is dicsharging [15:23:43] marostegui: yep last batch [15:23:46] bd808: python_startup.py's repository has a LICENSE file that's not copied here too [15:23:57] (03CR) 10BBlack: [C: 031] VCL: use resp.reason for synthetic responses generation [puppet] - 10https://gerrit.wikimedia.org/r/358057 (owner: 10Ema) [15:24:07] There is a learn cycle going on [15:24:48] I don't think most of these pass the threshold for being copyrightable, but for those that their authors have explicitly attached a copyright statement and license, I think we should be honoring it [15:25:22] elukey: Auto-Learn Mode: Transparent [15:25:29] paravoid: *nod* I'll look at it tonight [15:25:30] So you do have the auto-learn enabled [15:25:49] bd808: no worries, and sorry, I realize this is a bit silly :( [15:26:35] 10Operations, 10Analytics-Kanban, 10Traffic, 10Patch-For-Review: Replace Analytics XFF/client.ip data with X-Client-IP - https://phabricator.wikimedia.org/T118557#3345058 (10Ottomata) @bblack, just one last double check: are you sure XFF is not useful for ops purposes? We can easily exclude this data fro... [15:26:44] (03PS1) 10RobH: ms-be2001 through ms-be2012 decom [puppet] - 10https://gerrit.wikimedia.org/r/358615 [15:27:15] (03CR) 10RobH: [C: 032] ms-be2001 through ms-be2012 decom [puppet] - 10https://gerrit.wikimedia.org/r/358615 (owner: 10RobH) [15:28:41] (03PS10) 10Mforns: Add white-list for EventLogging auto-purging [puppet] - 10https://gerrit.wikimedia.org/r/298721 (https://phabricator.wikimedia.org/T108850) [15:31:07] (03PS1) 10RobH: decommission of ms-be2001 through ms-be2012 [dns] - 10https://gerrit.wikimedia.org/r/358618 [15:31:42] 10Operations, 10Analytics-Kanban, 10Traffic, 10Patch-For-Review: Replace Analytics XFF/client.ip data with X-Client-IP - https://phabricator.wikimedia.org/T118557#3345069 (10BBlack) No I don't think we need it for non-immediate analysis like this. We still `zero`, `zeronet` and `proxy` in the X-Analytics... [15:31:53] PROBLEM - MegaRAID on analytics1063 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough [15:32:03] PROBLEM - puppet last run on ms-be2003 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [15:32:12] 10Operations, 10Analytics-Kanban, 10Traffic, 10Patch-For-Review: Replace Analytics XFF/client.ip data with X-Client-IP - https://phabricator.wikimedia.org/T118557#3345072 (10Ottomata) Ok! Will merge this today then, thanks. [15:32:13] PROBLEM - puppet last run on ms-be2002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [15:32:13] elukey: ^ that is because the BBU is doing a learning cycle - because it is enabled [15:32:13] PROBLEM - puppet last run on ms-be2007 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [15:32:13] PROBLEM - puppet last run on ms-be2001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [15:32:23] PROBLEM - puppet last run on ms-be2011 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [15:32:57] thats my fault [15:33:02] i forgot to put them in maint, just did [15:33:16] all the ms-be2001 through ms-be2012 are being decommissioned [15:33:31] well, i thoguht they were in maint from earlier steps, but i shoudl have checked ;] [15:34:05] elukey: And now 1062…I guess all of them will start the learning cycle if they were racked around the same time [15:34:25] I will create a task so you guys can decide if you are fine with that or not [15:35:36] (03PS4) 10Ema: VCL: rate limit text-frontend requests [puppet] - 10https://gerrit.wikimedia.org/r/358583 (https://phabricator.wikimedia.org/T163233) [15:35:43] (03CR) 10Ema: [V: 032 C: 032] VCL: rate limit text-frontend requests [puppet] - 10https://gerrit.wikimedia.org/r/358583 (https://phabricator.wikimedia.org/T163233) (owner: 10Ema) [15:36:00] 10Operations, 10ops-eqiad, 10Labs, 10Labs-Infrastructure: rack/setup/install labcontrol100[34] - https://phabricator.wikimedia.org/T165781#3345077 (10Cmjohnson) [15:36:24] 10Operations, 10ops-eqiad, 10Labs, 10Labs-Infrastructure: rack/setup/install labcontrol100[34] - https://phabricator.wikimedia.org/T165781#3276700 (10Cmjohnson) Racked in A6 and B7 [15:36:51] there's been a spike of upload 503s after I moved swift_thumbs to codfw, subsiding now tho [15:40:09] 10Operations, 10Analytics: New analytic hosts with BBU learning cycle enabled - https://phabricator.wikimedia.org/T167809#3345083 (10Marostegui) [15:40:23] PROBLEM - MegaRAID on analytics1062 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough [15:40:35] (03CR) 10RobH: [C: 032] decommission of ms-be2001 through ms-be2012 [dns] - 10https://gerrit.wikimedia.org/r/358618 (owner: 10RobH) [15:40:49] 10Operations, 10ops-eqiad, 10Analytics: analytics1067: Broken BBU - https://phabricator.wikimedia.org/T167797#3344626 (10Marostegui) Looks like when the server is recharging it might not shown the correct status of the BBU, looks like this wasn't broking, just started an Auto-Learn cycle: T167809 [15:41:10] 10Operations, 10Analytics: New analytic hosts with BBU learning cycle enabled - https://phabricator.wikimedia.org/T167809#3345083 (10Marostegui) p:05Triage>03Normal [15:41:18] !log restart of relforge1001 to test https://gerrit.wikimedia.org/r/#/c/358353/ [15:41:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:41:35] !log upload apertium-cat_2.1.0~r78615-1+wmf to apt.wikimedia.org/jessie-wikimedia/main [15:41:35] !log upload apertium-fra_1.1.0~r78695-1+wmf to apt.wikimedia.org/jessie-wikimedia/main [15:41:35] !log upload apertium-ita_0.9.0~r78828-1+wmf to apt.wikimedia.org/jessie-wikimedia/main [15:41:35] !log upload apertium-spa_1.0.0~r78827-1+wmf to apt.wikimedia.org/jessie-wikimedia/main [15:41:37] 10Operations, 10ops-codfw, 10hardware-requests, 10Patch-For-Review, 10User-fgiunchedi: Decommission ms-be2001 - ms-be2012 - https://phabricator.wikimedia.org/T162785#3345119 (10RobH) [15:41:37] (03PS4) 10Ema: VCL: use resp.reason for synthetic responses generation [puppet] - 10https://gerrit.wikimedia.org/r/358057 [15:41:42] 10Operations, 10Analytics: New analytic hosts with BBU learning cycle enabled - https://phabricator.wikimedia.org/T167809#3345083 (10Marostegui) [15:41:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:41:45] 10Operations, 10ops-codfw, 10hardware-requests, 10Patch-For-Review, 10User-fgiunchedi: Decommission ms-be2001 - ms-be2012 - https://phabricator.wikimedia.org/T162785#3174562 (10RobH) switch port details: asw-a-codfw: ge-1/0/1 ms-be2001 ge-3/0/40 ms-be2002 ge-4/0/40 ms-be2003 ge-5/0/18 ms-be2004 asw-... [15:41:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:41:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:42:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:42:58] 10Operations, 10Analytics, 10User-Elukey: New analytic hosts with BBU learning cycle enabled - https://phabricator.wikimedia.org/T167809#3345126 (10elukey) [15:43:31] (03CR) 10Gehel: "This is looking good on relforge1001" [puppet] - 10https://gerrit.wikimedia.org/r/358353 (owner: 10Gehel) [15:44:13] 10Operations, 10ops-eqiad, 10Analytics-Kanban: analytics1030 stuck in console while booting - https://phabricator.wikimedia.org/T162046#3345131 (10Cmjohnson) 05Resolved>03Open The bbu has bene replaced and the system board that was replaced needs to be swapped again. The service tag that was burned in w... [15:47:14] (03PS1) 10Filippo Giunchedi: hieradata: a/a for swift [puppet] - 10https://gerrit.wikimedia.org/r/358620 [15:47:16] (03PS1) 10Filippo Giunchedi: hieradata: move swift codfw to passive [puppet] - 10https://gerrit.wikimedia.org/r/358621 [15:47:18] (03PS1) 10Filippo Giunchedi: hieradata: point esams to swift eqiad [puppet] - 10https://gerrit.wikimedia.org/r/358622 [15:47:45] (03CR) 10Filippo Giunchedi: [C: 04-1] "do not merge yet" [puppet] - 10https://gerrit.wikimedia.org/r/358620 (owner: 10Filippo Giunchedi) [15:47:49] (03CR) 10Filippo Giunchedi: [C: 04-1] "do not merge yet" [puppet] - 10https://gerrit.wikimedia.org/r/358621 (owner: 10Filippo Giunchedi) [15:47:53] (03CR) 10Filippo Giunchedi: [C: 04-1] "do not merge yet" [puppet] - 10https://gerrit.wikimedia.org/r/358622 (owner: 10Filippo Giunchedi) [15:48:02] 10Operations, 10ops-codfw, 10hardware-requests, 10Patch-For-Review, 10User-fgiunchedi: Decommission ms-be2001 - ms-be2012 - https://phabricator.wikimedia.org/T162785#3345155 (10RobH) a:05RobH>03Papaul [15:48:38] 10Operations, 10ops-codfw, 10hardware-requests, 10Patch-For-Review, 10User-fgiunchedi: Decommission ms-be2001 - ms-be2012 - https://phabricator.wikimedia.org/T162785#3174562 (10RobH) Ok, this is now ready for all disks to be wiped, and then removed from the racks for decommission. [15:52:43] (03PS1) 10EBernhardson: Test elastic2020 does not fall out of cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/358625 [15:54:34] (03CR) 10Alexandros Kosiaris: "reckeck" [debs/contenttranslation/apertium-spa-ita] - 10https://gerrit.wikimedia.org/r/358570 (https://phabricator.wikimedia.org/T167247) (owner: 10KartikMistry) [15:54:36] (03CR) 10jerkins-bot: [V: 04-1] Test elastic2020 does not fall out of cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/358625 (owner: 10EBernhardson) [15:54:40] (03CR) 10Alexandros Kosiaris: "recheck" [debs/contenttranslation/apertium-fra-cat] - 10https://gerrit.wikimedia.org/r/358517 (https://phabricator.wikimedia.org/T167247) (owner: 10KartikMistry) [15:56:37] (03CR) 10DCausse: elasticsearch: use $facts['ipaddress'] as the published host (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/358353 (owner: 10Gehel) [15:57:12] (03PS11) 10BBlack: numa_networking: support NUMA in interface::rps [puppet] - 10https://gerrit.wikimedia.org/r/355810 [15:57:14] (03PS11) 10BBlack: numa_networking: support NUMA in tlsproxy nginx config [puppet] - 10https://gerrit.wikimedia.org/r/355811 [15:57:16] (03PS4) 10BBlack: numa_networking: test enable on cp4021 [puppet] - 10https://gerrit.wikimedia.org/r/357844 [15:57:18] (03PS4) 10BBlack: numa_networking: remove install-time bnx2x stuff [puppet] - 10https://gerrit.wikimedia.org/r/357850 [15:58:11] (03PS1) 10Alexandros Kosiaris: Fix kubestagetcd100X PTRs [dns] - 10https://gerrit.wikimedia.org/r/358628 [15:58:48] good god that name [15:59:04] ahahahahah [15:59:35] (03CR) 10Filippo Giunchedi: [C: 031] "LGTM, just a nit" (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/349352 (owner: 10Krinkle) [15:59:44] i hope someone put that on the naming standards page [15:59:51] it looks out of date, heh [16:00:03] * robh has to add the labtest stuff he installed last week [16:00:05] godog and _joe_: Dear anthropoid, the time has come. Please deploy Puppet SWAT(Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170613T1600). [16:00:05] Amir1 and Krinkle: A patch you scheduled for Puppet SWAT(Max 8 patches) is about to be deployed. Please be available during the process. [16:00:12] o/ [16:00:25] (03PS6) 10Krinkle: mwgrep: Add --etitle option [puppet] - 10https://gerrit.wikimedia.org/r/349352 [16:00:40] o/ [16:00:55] Amir1: hi, has https://gerrit.wikimedia.org/r/#/c/357985/ been tested e.g. in beta? [16:01:13] Krinkle: hi, I'll merge your patches first since they are easier [16:01:26] (03Draft1) 10Paladox: irc echo: Convert package from python-irclib to python-irc [puppet] - 10https://gerrit.wikimedia.org/r/358626 [16:01:29] (03PS2) 10Paladox: irc echo: Convert package from python-irclib to python-irc [puppet] - 10https://gerrit.wikimedia.org/r/358626 [16:01:38] godog: nope, I can do it fast [16:01:49] Amir1: yes please! [16:01:55] (03PS4) 10Filippo Giunchedi: mwgrep: If --title is set, don't also require '*.js/.css' [puppet] - 10https://gerrit.wikimedia.org/r/349351 (owner: 10Krinkle) [16:02:58] on it [16:03:06] (03CR) 10Krinkle: irc echo: Convert package from python-irclib to python-irc (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/358626 (owner: 10Paladox) [16:04:02] (03PS3) 10Paladox: irc echo: Convert package from python-irclib to python-irc [puppet] - 10https://gerrit.wikimedia.org/r/358626 [16:04:06] (03CR) 10Paladox: irc echo: Convert package from python-irclib to python-irc (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/358626 (owner: 10Paladox) [16:04:29] (03CR) 10Filippo Giunchedi: [C: 032] mwgrep: If --title is set, don't also require '*.js/.css' [puppet] - 10https://gerrit.wikimedia.org/r/349351 (owner: 10Krinkle) [16:04:37] (03PS7) 10Filippo Giunchedi: mwgrep: Add --etitle option [puppet] - 10https://gerrit.wikimedia.org/r/349352 (owner: 10Krinkle) [16:04:55] (03CR) 10Filippo Giunchedi: [V: 032 C: 032] mwgrep: Add --etitle option [puppet] - 10https://gerrit.wikimedia.org/r/349352 (owner: 10Krinkle) [16:04:57] PROBLEM - puppet last run on labvirt1011 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [16:05:19] (03CR) 10DCausse: [C: 031] elasticsearch: use $facts['ipaddress'] as the published host (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/358353 (owner: 10Gehel) [16:05:23] Krinkle: ^ merged both [16:05:26] godog: thx [16:05:36] godog: Can you apply on terbium for me to verify? [16:05:37] (03CR) 10Alexandros Kosiaris: [C: 032] Fix kubestagetcd100X PTRs [dns] - 10https://gerrit.wikimedia.org/r/358628 (owner: 10Alexandros Kosiaris) [16:05:57] godog: cherry-picked in the puppetmaster, do you know where is only puppet agent on mediawiki nodes and restarting apache is enough or I should do something else? [16:06:11] like in another node (do they have loadbalancer?) [16:06:32] Krinkle: sure, {{done}} [16:07:09] (03PS2) 10Gehel: elasticsearch: use $facts['ipaddress'] as the published host [puppet] - 10https://gerrit.wikimedia.org/r/358353 [16:07:14] Amir1: yeah after puppet and apache graceful if needed then should be live already, what host have you tried on btw? [16:07:27] PROBLEM - MegaRAID on analytics1061 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough [16:07:29] mediawiki nodes [16:07:45] godog: verified. thanks [16:08:30] !log installing libnl security updates on trusty [16:08:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:08:39] Amir1: ack, which hostname though? I wanted to take a look at what puppet did and if I'll need to do apache graceful [16:08:59] I'm doing to deployment-mediawiki05 right now [16:09:14] puppet agent is rewriting the file [16:10:12] ok thanks! I'll do 04 [16:10:18] RECOVERY - MegaRAID on analytics1062 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy [16:10:45] 10Operations, 10HHVM: Nutcracker doesn't start at boot - https://phabricator.wikimedia.org/T163795#3345200 (10MoritzMuehlenhoff) 05Open>03Resolved Fixed package is fully rolled out now. [16:11:48] RECOVERY - MegaRAID on analytics1063 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy [16:12:39] Amir1: yeah looks like apache2ctl graceful is needed in this case [16:12:55] (03CR) 10Paladox: [C: 04-1] "Sending to many messages results in" [puppet] - 10https://gerrit.wikimedia.org/r/358626 (owner: 10Paladox) [16:13:04] godog: oh it seems this apache rules are not being applied in beta cluster [16:13:05] https://wikidata.beta.wmflabs.org/entity/Q7251 [16:13:08] :/ [16:14:03] Amir1: indeed, 404s for me [16:14:40] comparing https://wikidata.org/entity/Q7251 [16:14:51] so everything we do is not needed [16:15:37] what do you mean ? [16:16:01] I mean it's not possible to test in beta cluster [16:16:18] as the file is not being used to rewrite paths [16:16:27] otherwise it would redirect and didn't 404 [16:16:57] indeed, I'm trying to understand why that's the case, afaik the wikidata redirects are supposed to work in beta too ? [16:17:22] probably no one cared enough to make it happen [16:17:29] (03CR) 10DCausse: [C: 031] "elastic 5.3.2 is now running on production with these plugins, are we good to go?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/350312 (https://phabricator.wikimedia.org/T163829) (owner: 10Tjones) [16:18:59] quite possibly, checking what's up [16:20:24] thanks [16:22:12] 10Operations, 10Citoid, 10VisualEditor, 10Services (blocked), 10User-mobrovac: Wiley requests for DOI and some other publishers don't work in production - https://phabricator.wikimedia.org/T165105#3345250 (10czar) @Samwalton9 are you in contact with Wiley [[ https://en.wikipedia.org/wiki/Wikipedia:The_Wi... [16:22:22] (03PS4) 10Paladox: irc echo: Convert package from python-irclib to python-irc [puppet] - 10https://gerrit.wikimedia.org/r/358626 [16:23:02] (03CR) 10Paladox: "Fixed the error now. Tested with doing echo "testing" >> /var/log/icinga2/irc.log multiple times until the bot quit due to excess flood an" [puppet] - 10https://gerrit.wikimedia.org/r/358626 (owner: 10Paladox) [16:24:37] (03PS1) 10Filippo Giunchedi: mediawiki: match beta wikidata with production includes [puppet] - 10https://gerrit.wikimedia.org/r/358631 (https://phabricator.wikimedia.org/T119536) [16:24:43] Amir1: ^ [16:25:01] I don't see anything, muted wikibugs :D [16:25:59] hehe fair enough, I turn bots messages into NOTICE or can't tell bots from humans [16:26:32] Amir1: https://gerrit.wikimedia.org/r/358631 [16:26:57] (03PS2) 10EBernhardson: Test elastic2020 does not fall out of cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/358625 [16:27:10] (03CR) 10Ladsgroup: [C: 031] mediawiki: match beta wikidata with production includes [puppet] - 10https://gerrit.wikimedia.org/r/358631 (https://phabricator.wikimedia.org/T119536) (owner: 10Filippo Giunchedi) [16:27:27] godog: thanks +1'd [16:27:40] I can cherry-pick it there too [16:28:09] (03CR) 10Filippo Giunchedi: [C: 032] mediawiki: match beta wikidata with production includes [puppet] - 10https://gerrit.wikimedia.org/r/358631 (https://phabricator.wikimedia.org/T119536) (owner: 10Filippo Giunchedi) [16:28:11] (03CR) 10EBernhardson: "we should be migrating to 5.3.2 really soon™" [puppet] - 10https://gerrit.wikimedia.org/r/356900 (owner: 10Hashar) [16:28:40] Amir1: no worries, just merged now so you can rebase beta puppetmaster [16:28:57] that's easier :D [16:29:54] rebased [16:30:23] running puppet agent -tv on mediawik05 [16:31:48] PROBLEM - SSH on ms-be1019 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:31:57] RECOVERY - puppet last run on labvirt1011 is OK: OK: Puppet is currently enabled, last run 18 seconds ago with 0 failures [16:32:29] (03PS24) 10Gehel: maps - move to role / profile [puppet] - 10https://gerrit.wikimedia.org/r/347006 [16:32:37] RECOVERY - SSH on ms-be1019 is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2.8 (protocol 2.0) [16:32:56] Amir1: LGTM on 04 [16:33:29] awesome [16:33:37] Shall we move on to prod? [16:35:13] !log upgrading osmium to HHVM 3.18 [16:35:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:35:26] Amir1: err I meant the change itself looks good, I'm testing it now [16:35:39] oh okay. [16:35:42] Thanks [16:37:22] Amir1: I'm still getting a 302 for the version without a dot [16:37:27] RECOVERY - MegaRAID on analytics1061 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy [16:37:33] curl -vL https://wikidata.beta.wmflabs.org/entity/Q7251 2>&1 > /dev/null | grep -e HTTP -e Location [16:37:35] godog: you should [16:37:40] but only one [16:37:42] not two [16:37:50] (03CR) 10EBernhardson: "hmm, i could probably write a quick script on terbium to find out all the wikis with changed settings from this patch" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/350312 (https://phabricator.wikimedia.org/T163829) (owner: 10Tjones) [16:38:49] (03CR) 10Nemo bis: "Hi, thanks for your patch. Was there any discussion about this? I only see a mention at https://phabricator.wikimedia.org/T74863#1712104" [puppet] - 10https://gerrit.wikimedia.org/r/306892 (owner: 10Racodond) [16:38:56] godog: the first chain is https://wikidata.beta.wmflabs.org/entity/Q7251 to https://wikidata.beta.wmflabs.org/wiki/Special:EntityPage/Q7251 and then to https://wikidata.beta.wmflabs.org/wiki/Q7251 [16:39:05] this should make the first redirect internal [16:39:16] so user only sees one redirect [16:39:20] yeah, I'm getting 302 -> 303 -> 200 for the chain above ATM [16:39:21] 10Operations, 10ops-codfw: ms-be2002.codfw.wmnet has drac issues - https://phabricator.wikimedia.org/T155689#3345296 (10Papaul) 05Open>03Resolved This is one of the system that will be decommission in T162785 so closing this task. [16:39:23] 10Operations, 10Patch-For-Review: Remote IPMI doesn't work for ~2% of the fleet - https://phabricator.wikimedia.org/T150160#3345300 (10Papaul) [16:39:27] do you see the same? [16:39:45] godog: I don't know how to test it on the mediawiki05 host [16:39:54] we might be hitting the wrong ndoe [16:39:56] *node [16:40:03] Amir1: I did it externally with curl -vL https://wikidata.beta.wmflabs.org/entity/Q7251 2>&1 > /dev/null | grep -e HTTP -e Location [16:40:08] (03PS2) 10Gehel: maps - renamed cassandra passwords for role / profile refactoring [labs/private] - 10https://gerrit.wikimedia.org/r/353068 [16:40:19] both nodes should be ok now, I mean earlier it wasn't even working [16:40:27] (03CR) 10Gehel: [V: 032 C: 032] maps - renamed cassandra passwords for role / profile refactoring [labs/private] - 10https://gerrit.wikimedia.org/r/353068 (owner: 10Gehel) [16:42:15] oh, I get it now, I thought there are five nodes [16:42:22] so sorry. [16:42:37] np, you can test locally too with sth like this [16:42:41] curl -v -H 'Host: wikidata.beta.wmflabs.org' -H 'x-forwarded-proto: https' localhost:80/entity/Q7251 [16:43:12] ditto, 302 on that to Special:EntityData [16:45:11] that' unexpected [16:45:28] okay. I will re-read apache configs [16:45:44] (03CR) 10Alexandros Kosiaris: "recheck" [debs/contenttranslation/apertium-spa-ita] - 10https://gerrit.wikimedia.org/r/358570 (https://phabricator.wikimedia.org/T167247) (owner: 10KartikMistry) [16:45:59] ok! should be easier now that's testable in beta [16:46:13] yeah [16:46:25] I'm guessing the PT flag is missing https://httpd.apache.org/docs/2.4/rewrite/remapping.html [16:46:36] but my apache knowledge is not great [16:46:38] so I just try [16:47:08] 10Operations, 10ops-codfw, 10DC-Ops, 10Discovery, and 3 others: elastic2020 is powered off and does not want to restart - https://phabricator.wikimedia.org/T149006#3140522 (10Gehel) Actually, we are still going to do a last test of switching traffic from eqiad to codfw and see if that server crashes or not. [16:47:28] Amir1: we have a local apache httpd resident expert, elukey ! [16:49:25] It would be great! [16:49:41] (03CR) 10Tjones: "Based on https://www.mediawiki.org/wiki/Special:SiteMatrix, it should be zh/Chinese-language projects: Chinese wikipedia, wiktionary, wiki" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/350312 (https://phabricator.wikimedia.org/T163829) (owner: 10Tjones) [16:49:49] (03PS5) 10Tjones: Enable BM25 for Chinese wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/350312 (https://phabricator.wikimedia.org/T163829) [16:50:10] Amir1: I have to run, I'm assuming the patch will be up for another puppet swat [16:50:32] yeah, can I ping you again in non-swat time? [16:51:05] I'd prefer if we keep it to swat since avoiding interruptions is part of the reason of puppet swat [16:51:27] (03CR) 10Tjones: "And I'm assuming T166722 / I8b5dd2ac974e3e6fed92c70a5992a6d7b7a9b852 has been or will be deployed before any indexing happens." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/350312 (https://phabricator.wikimedia.org/T163829) (owner: 10Tjones) [16:51:36] godog: that's a good one [16:51:51] but if you are unsure about sth about the patch and it'd help move it along sure Amir1 [16:52:03] elukey: I thought it was true!! [16:52:25] Thanks [16:52:32] Amir1: I didn't follow all but if you want I can help [16:53:07] elukey: the thing is that we want to have an internal redirect instead of sending out a 302 [16:53:12] (03PS1) 10Gehel: maps - refactoring to role/profile: parameter renaming [labs/private] - 10https://gerrit.wikimedia.org/r/358633 [16:53:30] Amir1: internal redirect == rewrite right? [16:53:41] yeah [16:53:44] super [16:54:53] elukey: https://gerrit.wikimedia.org/r/#/c/357985/ [16:55:17] the patch is this which I used this manual https://httpd.apache.org/docs/2.4/rewrite/remapping.html but it seems it's not working [16:55:26] you can test it in beta cluster [16:57:04] what happens? [16:57:23] it still sends out 302 [16:57:50] https://www.irccloud.com/pastebin/WKoi4onM/ [16:58:04] (03CR) 10Gehel: [V: 032 C: 032] maps - refactoring to role/profile: parameter renaming [labs/private] - 10https://gerrit.wikimedia.org/r/358633 (owner: 10Gehel) [16:58:55] elukey: or externally: wget -S https://wikidata.beta.wmflabs.org/entity/Q36661 2>&1 | egrep '^ *HTTP|^ *Location' [16:59:11] Amir1: are you sure that the config gets applied in there? [16:59:22] yes [16:59:32] I mean, that beta shares the same with prod [16:59:45] because at first it sent 404 and we just fixed it [16:59:54] one thing that we could do is set the rewrite module logs to trace [16:59:57] and see what it does [17:00:03] let me hack mediawiki05 [17:00:04] gwicke, cscott, arlolra, subbu, halfak, and Amir1: Dear anthropoid, the time has come. Please deploy Services – Graphoid / Parsoid / OCG / Citoid / ORES (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170613T1700). [17:00:15] thanks [17:00:25] I have a deploy for ores [17:00:33] no parsoid deploy today [17:00:51] https://gerrit.wikimedia.org/r/#/c/358631/ [17:01:35] \o/ [17:01:52] Amir1, did you check on beta. Anything crazy that might make us not want to deploy? [17:02:09] halfak: did you deploy it there? [17:02:14] yup [17:02:25] awesome, let me check that [17:02:40] it's not exploded [17:02:44] https://ores-beta.wmflabs.org/v2/scores/wikidatawiki/?models=damaging&revids=421063984 [17:03:09] https://ores-beta.wmflabs.org/v3/scores/frwikisource/235435 [17:03:20] nice job halfak [17:03:30] :D [17:03:38] it seems okay to me to go to prod [17:03:46] GOod to go then. Are you going to do the deploy? [17:03:59] if that's okay for you [17:06:02] (03CR) 10Gehel: "No, there wasn't any discussion that I know of, except in person between David (Racodond) and myself. This started as an experiment to see" [puppet] - 10https://gerrit.wikimedia.org/r/306892 (owner: 10Racodond) [17:06:11] Amir1, rock on [17:06:13] I'm on standby [17:06:19] thanks [17:06:24] jouncebot: next [17:06:24] In 1 hour(s) and 53 minute(s): MediaWiki train (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170613T1900) [17:06:34] !log rebooting sca2003 for tests [17:06:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:08:51] halfak: just to confirm, everything in gerrit is ready and no more patch is needed, right? [17:08:55] 10Operations, 10ops-codfw, 10Labs, 10Labs-Infrastructure, 10Patch-For-Review: rack/setup/install labtestneutron2002 - https://phabricator.wikimedia.org/T167160#3345372 (10Papaul) @RobH can we check please partman recipe for this system and labtestnet2002. I am stuck at partition during install, Please se... [17:08:57] PROBLEM - Host sca2003 is DOWN: PING CRITICAL - Packet loss = 100% [17:08:57] PROBLEM - Host sca2003 is DOWN: PING CRITICAL - Packet loss = 100% [17:09:25] Amir1, I believe that is right. [17:09:29] That's what is in beta now. [17:09:35] awesome [17:10:17] RECOVERY - Host sca2003 is UP: PING OK - Packet loss = 0%, RTA = 0.46 ms [17:10:17] RECOVERY - Host sca2003 is UP: PING OK - Packet loss = 0%, RTA = 0.46 ms [17:11:07] (03PS3) 10Reedy: Promote CollaborationKit to the big leagues; deploy on testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/343697 (https://phabricator.wikimedia.org/T138326) [17:11:09] (03PS1) 10Reedy: Move CollaborationKit i18n to non labs extension-list [mediawiki-config] - 10https://gerrit.wikimedia.org/r/358638 (https://phabricator.wikimedia.org/T138326) [17:11:09] 10Operations, 10ops-codfw, 10Labs, 10Labs-Infrastructure, 10Patch-For-Review: rack/setup/install labtestneutron2002 - https://phabricator.wikimedia.org/T167160#3345391 (10RobH) a:05Papaul>03RobH [17:11:44] !log ladsgroup@tin Started deploy [ores/deploy@862aea9]: ORES deploy early June: T167223 [17:11:45] (03CR) 10Alexandros Kosiaris: [C: 031] "LGTM, I 've even tested it live on tegmen. That being said I am wary of merging it right now, I 'll be merging it tomorrow when I will be " [puppet] - 10https://gerrit.wikimedia.org/r/358626 (owner: 10Paladox) [17:11:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:11:53] T167223: Early June ORES prod deploy - https://phabricator.wikimedia.org/T167223 [17:12:12] * Reedy kicks wikibugs [17:12:26] (03CR) 10Paladox: "Thanks :)" [puppet] - 10https://gerrit.wikimedia.org/r/358626 (owner: 10Paladox) [17:14:32] (03PS1) 10Reedy: Remove duplicate config from CommonSettings-labs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/358639 [17:15:18] okay, canary is up, let's wait to see alarms [17:16:29] K-lined? really? [17:19:50] That's better [17:19:52] jouncebot: next [17:19:52] In 1 hour(s) and 40 minute(s): MediaWiki train (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170613T1900) [17:19:55] jouncebot: now [17:19:55] For the next 0 hour(s) and 40 minute(s): Services – Graphoid / Parsoid / OCG / Citoid / ORES (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170613T1700) [17:20:32] Amir1, which is our canary again? [17:20:37] halfak: this doesn't look good https://grafana.wikimedia.org/dashboard/db/ores-extension?orgId=1 [17:20:38] 1002 [17:21:04] Looks good [17:21:15] halfak: it's not related to our deploy but we had down time today [17:21:43] 10Operations, 10Operations-Software-Development, 10monitoring, 10Patch-For-Review: Monitoring: create an alert for daemonized puppet - https://phabricator.wikimedia.org/T166371#3345415 (10Dzahn) 05Open>03Resolved Yep, i was hoping for this to be the outcome. I'll call it resolved then. :) [17:22:12] WTF [17:22:23] akosiaris, did you see anything about this? [17:22:30] Amir1, did we have any alarms go off? [17:22:45] halfak: none for me [17:22:55] Amir1, I did the beta deploy at 1600 UTC [17:23:55] this is for prod, it is not relying on beta cluster anyhow [17:24:13] Weird. [17:24:48] We do see a spike in errors on ORES too [17:24:53] 10Operations, 10Discovery-Search (Current work), 10Patch-For-Review: replace es-tool with elasticsearch-curator for standard elasticsearch operations - https://phabricator.wikimedia.org/T166154#3345423 (10Gehel) logstash scripts are now using curator, some standard action files (enabling / disabling shard al... [17:24:56] Not timing out [17:25:20] Only affected eqiad [17:25:38] (03PS1) 10Chad: group0 to wmf.5 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/358641 [17:26:03] halfak: logstash seems fine [17:26:15] (03PS1) 10Chad: Scap clean: Move message handling around since we use it for lock files [mediawiki-config] - 10https://gerrit.wikimedia.org/r/358642 [17:26:17] Amir1, can you find out what errors mediawiki was getting? [17:26:19] maybe redis/jobqueue is having trouble [17:26:36] yeah, sure [17:27:10] time out [17:27:13] Error rate is still high for eqiad. [17:27:26] halfak: https://logstash.wikimedia.org/goto/d1bba2b66c8381de3e8b106d6e59cebc [17:28:20] halfak: large queue? someone is putting pressure? [17:28:35] I can increase the timeout time for now [17:28:45] Amir1, I don't think we should [17:28:59] Doesn't look like that. I just checked on a couple of those scores and could regenerate fast [17:29:04] E.g. https://ores.wikimedia.org/v3/scores/wikidatawiki/?models=damaging%7Cgoodfaith&revids=500498629&features&format=json [17:29:09] (03CR) 10Chad: [C: 032] Scap clean: Move message handling around since we use it for lock files [mediawiki-config] - 10https://gerrit.wikimedia.org/r/358642 (owner: 10Chad) [17:29:16] By adding "features" to the query params you force the score to regenerate [17:29:29] can you get a time? [17:29:37] PROBLEM - Check Varnish expiry mailbox lag on cp2002 is CRITICAL: CRITICAL: expiry mailbox lag is 2047052 [17:30:03] for mediawiki I'm guessing it's 1 second or so, I don't quite remember [17:30:18] (03CR) 10Nemo bis: "Yeah. I think a sensible first step, for anybody interested in this, would be to install a sonarqube instance somewhere and produce some o" [puppet] - 10https://gerrit.wikimedia.org/r/306892 (owner: 10Racodond) [17:30:20] (03Merged) 10jenkins-bot: Scap clean: Move message handling around since we use it for lock files [mediawiki-config] - 10https://gerrit.wikimedia.org/r/358642 (owner: 10Chad) [17:30:30] (03CR) 10jenkins-bot: Scap clean: Move message handling around since we use it for lock files [mediawiki-config] - 10https://gerrit.wikimedia.org/r/358642 (owner: 10Chad) [17:30:32] 1.45s [17:30:49] Avg for 10 requests for that specific score [17:30:53] 10Operations, 10Deployment-Systems: Have fallback communication channel when freenode has problems - https://phabricator.wikimedia.org/T127904#3345441 (10Dzahn) We could just agree on something like "if freenode is down we all switch to efnet, same channel names" and be done with it. vs. installing our own irc... [17:31:14] 1.45s is expected behavior [17:31:15] nice [17:31:19] 10Operations, 10Discovery-Search (Current work), 10Patch-For-Review: replace es-tool with elasticsearch-curator for standard elasticsearch operations - https://phabricator.wikimedia.org/T166154#3345445 (10debt) 05Open>03Resolved Thanks, @Gehel ! [17:33:45] (03PS1) 10Chad: scap clean: We need --force defined, but we don't want people to use it [mediawiki-config] - 10https://gerrit.wikimedia.org/r/358643 [17:36:38] Amir1, I think we should move forward with this deploy [17:36:44] Whatever this is, it's not related. [17:36:56] yeah, I already pushed the deploy [17:37:33] ok cool [17:37:42] (03CR) 10Chad: [C: 032] scap clean: We need --force defined, but we don't want people to use it [mediawiki-config] - 10https://gerrit.wikimedia.org/r/358643 (owner: 10Chad) [17:38:17] halfak: Can you make a phab card? [17:38:22] (03CR) 10Chad: [C: 032] Move CollaborationKit i18n to non labs extension-list [mediawiki-config] - 10https://gerrit.wikimedia.org/r/358638 (https://phabricator.wikimedia.org/T138326) (owner: 10Reedy) [17:38:23] with UBN! status [17:38:43] (03Merged) 10jenkins-bot: scap clean: We need --force defined, but we don't want people to use it [mediawiki-config] - 10https://gerrit.wikimedia.org/r/358643 (owner: 10Chad) [17:38:55] (03CR) 10jenkins-bot: scap clean: We need --force defined, but we don't want people to use it [mediawiki-config] - 10https://gerrit.wikimedia.org/r/358643 (owner: 10Chad) [17:39:28] !log restart varnish-be on cp2002 (mailbox lag, likely induced by swift traffic testing in codfw) [17:39:35] (03Merged) 10jenkins-bot: Move CollaborationKit i18n to non labs extension-list [mediawiki-config] - 10https://gerrit.wikimedia.org/r/358638 (https://phabricator.wikimedia.org/T138326) (owner: 10Reedy) [17:39:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:40:13] https://phabricator.wikimedia.org/T167819 [17:40:17] Amir1, ^ [17:40:19] !log demon@tin Pruned MediaWiki: 1.30.0-wmf.1 [keeping static files] (duration: 05m 10s) [17:40:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:40:59] (03CR) 10jenkins-bot: Move CollaborationKit i18n to non labs extension-list [mediawiki-config] - 10https://gerrit.wikimedia.org/r/358638 (https://phabricator.wikimedia.org/T138326) (owner: 10Reedy) [17:41:54] 10Operations, 10Wikimedia-Language-setup, 10Wikimedia-Site-requests, 10MW-1.30-release-notes (WMF-deploy-2017-06-13_(1.30.0-wmf.5)), 10Patch-For-Review: Create Atikamekw Wikipedia - https://phabricator.wikimedia.org/T167714#3345502 (10Benoit_Rochon) Just a little question. Is this task is planned for the... [17:42:01] Memory usage for EQIAD is much higher than CODFW. [17:42:08] (not us -- just on the machines generally) [17:42:36] !log demon@tin Pruned MediaWiki: 1.30.0-wmf.2 [keeping static files] (duration: 01m 13s) [17:42:40] Probably because it's doing something? ;) [17:42:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:43:15] Amir1, has deploy finished yet? [17:43:37] 77% [17:44:33] kk [17:45:07] halfak: Something definitely happened on 1600 UTC [17:45:07] https://grafana.wikimedia.org/dashboard/db/ores?panelId=13&fullscreen&orgId=1&from=1497364774089&to=1497375574089 [17:45:08] from scb1004 "ores.wsgi.util.ParamError: Could not interpret revids. invalid literal for int() with base 10: 'last'" [17:45:13] DDoS? [17:45:17] !log demon@tin Started scap: testwiki to wmf.5 + l10n bootstrap [17:45:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:45:36] !log ladsgroup@tin Finished deploy [ores/deploy@862aea9]: ORES deploy early June: T167223 (duration: 33m 52s) [17:45:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:45:45] T167223: Early June ORES prod deploy - https://phabricator.wikimedia.org/T167223 [17:45:48] "(duration: 33m 52s)" [17:45:49] fun [17:45:56] (03PS12) 10BBlack: numa_networking: support NUMA in interface::rps [puppet] - 10https://gerrit.wikimedia.org/r/355810 [17:45:58] (03PS12) 10BBlack: numa_networking: support NUMA in tlsproxy nginx config [puppet] - 10https://gerrit.wikimedia.org/r/355811 [17:45:59] Yeah. This is a param error [17:46:00] (03PS5) 10BBlack: numa_networking: test enable on cp4021 [puppet] - 10https://gerrit.wikimedia.org/r/357844 [17:46:02] (03PS5) 10BBlack: numa_networking: remove install-time bnx2x stuff [puppet] - 10https://gerrit.wikimedia.org/r/357850 [17:46:08] user error [17:46:26] I'm still keeping the old hash in case we want to revert [17:46:35] (03CR) 10BBlack: [V: 032 C: 032] "compiler no-op for existing hosts with no hiera $numa_networking turned on" [puppet] - 10https://gerrit.wikimedia.org/r/355810 (owner: 10BBlack) [17:46:43] See the full error here: https://phabricator.wikimedia.org/T167819#3345509 [17:47:00] (03CR) 10BBlack: [V: 032 C: 032] "Whitespace/comment diff for nginx.conf on existing hosts (ok)" [puppet] - 10https://gerrit.wikimedia.org/r/355811 (owner: 10BBlack) [17:47:07] 10Operations, 10Wikimedia-Language-setup, 10Wikimedia-Site-requests, 10MW-1.30-release-notes (WMF-deploy-2017-06-13_(1.30.0-wmf.5)), 10Patch-For-Review: Create Atikamekw Wikipedia - https://phabricator.wikimedia.org/T167714#3345510 (10Reedy) >>! In T167714#3345502, @Benoit_Rochon wrote: > Just a little q... [17:47:08] (03CR) 10BBlack: [V: 032 C: 032] numa_networking: test enable on cp4021 [puppet] - 10https://gerrit.wikimedia.org/r/357844 (owner: 10BBlack) [17:47:18] (03CR) 10BBlack: [V: 032 C: 032] numa_networking: remove install-time bnx2x stuff [puppet] - 10https://gerrit.wikimedia.org/r/357850 (owner: 10BBlack) [17:47:26] (03PS4) 10Ottomata: webrequest: remove X-Forwarded-For [puppet] - 10https://gerrit.wikimedia.org/r/253474 (https://phabricator.wikimedia.org/T118557) (owner: 10BBlack) [17:47:37] (03PS8) 10Nemo bis: Extend maximum allowed MediaWiki version to 1.26 [dumps] (ariel) - 10https://gerrit.wikimedia.org/r/171976 (https://phabricator.wikimedia.org/T68661) (owner: 10Wpmirrordev) [17:47:43] (03CR) 10jerkins-bot: [V: 04-1] Extend maximum allowed MediaWiki version to 1.26 [dumps] (ariel) - 10https://gerrit.wikimedia.org/r/171976 (https://phabricator.wikimedia.org/T68661) (owner: 10Wpmirrordev) [17:48:06] (03CR) 10Ottomata: [V: 032 C: 032] "Tested on cp1045 and with raw and refined webrequest." [puppet] - 10https://gerrit.wikimedia.org/r/253474 (https://phabricator.wikimedia.org/T118557) (owner: 10BBlack) [17:48:11] (03PS5) 10Ottomata: webrequest: remove X-Forwarded-For [puppet] - 10https://gerrit.wikimedia.org/r/253474 (https://phabricator.wikimedia.org/T118557) (owner: 10BBlack) [17:48:14] (03CR) 10Ottomata: [V: 032 C: 032] webrequest: remove X-Forwarded-For [puppet] - 10https://gerrit.wikimedia.org/r/253474 (https://phabricator.wikimedia.org/T118557) (owner: 10BBlack) [17:49:37] RECOVERY - Check Varnish expiry mailbox lag on cp2002 is OK: OK: expiry mailbox lag is 0 [17:50:06] !log merged removal of x_forwarded_for from all varnishkafka webrequest instances [17:50:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:52:27] !log cp4021 reboot for bnx2x modparam change [17:52:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:53:30] (03PS1) 10Dzahn: wikistats: fix group, cron for xml dump, no spam [puppet] - 10https://gerrit.wikimedia.org/r/358644 [17:54:25] (03PS1) 10Dzahn: rancid: add APT pin to jessie-backports release [puppet] - 10https://gerrit.wikimedia.org/r/358645 (https://phabricator.wikimedia.org/T159756) [17:55:38] jouncebot: next [17:55:38] In 1 hour(s) and 4 minute(s): MediaWiki train (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170613T1900) [17:55:40] (03CR) 10Faidon Liambotis: "Sounds good on itself, but maybe we should just go for stretch for netmon since it's being reinstalled so close to the release date?" [puppet] - 10https://gerrit.wikimedia.org/r/358645 (https://phabricator.wikimedia.org/T159756) (owner: 10Dzahn) [17:55:47] mutante: ^ [17:56:48] 10Operations, 10ops-eqiad, 10Labs, 10Labs-Infrastructure: rack/setup/install labweb100[12].wikimedia.org - https://phabricator.wikimedia.org/T167820#3345535 (10RobH) [17:57:57] paravoid: i was thinking about that .. yea.. should i reinstall netmon1002 with stretch. ok, i will try then :) [17:58:07] 4 days or so , heh [17:58:31] yeah :) [18:02:47] (03PS1) 10Dzahn: install_server: switch netmon1002 to stretch [puppet] - 10https://gerrit.wikimedia.org/r/358647 (https://phabricator.wikimedia.org/T159756) [18:02:51] 10Operations, 10Wikimedia-Language-setup, 10Wikimedia-Site-requests, 10MW-1.30-release-notes (WMF-deploy-2017-06-13_(1.30.0-wmf.5)), 10Patch-For-Review: Create Atikamekw Wikipedia - https://phabricator.wikimedia.org/T167714#3345579 (10Benoit_Rochon) Ho, I see. So it is not sure that Wikipedia Atikamekw w... [18:05:04] (03PS2) 10Dzahn: install_server: switch netmon1002 to stretch [puppet] - 10https://gerrit.wikimedia.org/r/358647 (https://phabricator.wikimedia.org/T159756) [18:08:02] (03CR) 10Dzahn: [C: 032] install_server: switch netmon1002 to stretch [puppet] - 10https://gerrit.wikimedia.org/r/358647 (https://phabricator.wikimedia.org/T159756) (owner: 10Dzahn) [18:10:48] (03CR) 10Elukey: [C: 04-1] "I helped a bit Amir while testing the patch on labs, and this will not work as intended. I turned to trace8 the rewrite logs on deployment" [puppet] - 10https://gerrit.wikimedia.org/r/357985 (https://phabricator.wikimedia.org/T119536) (owner: 10Ladsgroup) [18:17:36] (03PS2) 10Herron: Change wikimedia.org SPF record to soft fail (~all) [dns] - 10https://gerrit.wikimedia.org/r/358132 (https://phabricator.wikimedia.org/T133191) [18:21:55] (03PS1) 10RobH: setting labtestnet2002 and labtestweb2002 partitioning [puppet] - 10https://gerrit.wikimedia.org/r/358717 [18:23:15] oh neat, two bug:task# lines successfully links to two tasks [18:23:26] things i never do but did cuz this was tiny change. [18:24:18] (03PS1) 10Framawiki: Create a FeaturedFeed for the RAW bulletin on frwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/358772 (https://phabricator.wikimedia.org/T167617) [18:26:45] (03PS1) 10Ema: VCL: increase rate limit to 600/60s [puppet] - 10https://gerrit.wikimedia.org/r/358774 (https://phabricator.wikimedia.org/T163233) [18:27:31] !log demon@tin Finished scap: testwiki to wmf.5 + l10n bootstrap (duration: 42m 16s) [18:27:37] (03CR) 10Bearloga: "Test instance up and running on Labs thanks to @gehel: https://discovery-puppet-test.wmflabs.org just checked all the dashboards and they " [puppet] - 10https://gerrit.wikimedia.org/r/353571 (https://phabricator.wikimedia.org/T161354) (owner: 10Bearloga) [18:27:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:28:06] (03CR) 10BBlack: [C: 031] VCL: increase rate limit to 600/60s [puppet] - 10https://gerrit.wikimedia.org/r/358774 (https://phabricator.wikimedia.org/T163233) (owner: 10Ema) [18:28:48] (03CR) 10Ema: [V: 032 C: 032] VCL: increase rate limit to 600/60s [puppet] - 10https://gerrit.wikimedia.org/r/358774 (https://phabricator.wikimedia.org/T163233) (owner: 10Ema) [18:30:33] (03PS2) 10RobH: setting labtestnet2002 and labtestweb2002 partitioning [puppet] - 10https://gerrit.wikimedia.org/r/358717 [18:31:02] (03CR) 10RobH: [C: 032] setting labtestnet2002 and labtestweb2002 partitioning [puppet] - 10https://gerrit.wikimedia.org/r/358717 (owner: 10RobH) [18:31:37] PROBLEM - Work requests waiting in Zuul Gearman server https://grafana.wikimedia.org/dashboard/db/zuul-gearman on contint1001 is CRITICAL: CRITICAL: 50.00% of data above the critical threshold [140.0] [18:32:35] well that's a big queue https://integration.wikimedia.org/zuul/ [18:33:36] !log mobrovac@tin Started deploy [restbase/deploy@4c1cdd0]: (no justification provided) [18:33:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:36:18] PROBLEM - Restbase root url on restbase2001 is CRITICAL: connect to address 10.192.16.152 and port 7231: Connection refused [18:36:52] known ^ [18:37:56] !log mobrovac@tin Finished deploy [restbase/deploy@4c1cdd0]: (no justification provided) (duration: 04m 19s) [18:38:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:38:27] 10Operations, 10ops-eqiad, 10User-fgiunchedi: Debug HP raid cache disabled errors on ms-be1019/20/21 - https://phabricator.wikimedia.org/T163777#3345648 (10Cmjohnson) @fgiunchedi the bbu finally shipped. i will ping you once it arrives to swap Hewlett Packard Enterprise Reference Number: 5320104843 STATUS:... [18:39:06] !log mobrovac@tin Started deploy [restbase/deploy@4c1cdd0]: (no justification provided) [18:39:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:42:16] (03CR) 10Andrew Bogott: [C: 031] "This is fine with me, shouldn't make too big of a difference." [puppet] - 10https://gerrit.wikimedia.org/r/358601 (https://phabricator.wikimedia.org/T167803) (owner: 10Hashar) [18:42:27] PROBLEM - Restbase root url on restbase2005 is CRITICAL: connect to address 10.192.48.37 and port 7231: Connection refused [18:43:45] !log mobrovac@tin Started deploy [restbase/deploy@4c1cdd0]: (no justification provided) [18:43:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:46:38] PROBLEM - Restbase root url on restbase1010 is CRITICAL: connect to address 10.64.0.112 and port 7231: Connection refused [18:46:47] 10Operations, 10DBA, 10Wikimedia-Site-requests: Global rename of Idh0854 → Garam: supervision needed - https://phabricator.wikimedia.org/T167031#3345698 (10greg) >>! In T167031#3344164, @MarcoAurelio wrote: > @greg Can we get a one hour window on wikitech:Deployments in coordination with @jcrespo and @Marost... [18:47:17] anomie: for the huge # of patches you send, can you please push them once the tests have passed instead of CR+2 ? [18:47:25] anomie: or that is going to kill CI for the rest of the day [18:48:14] (03CR) 10EBernhardson: "A quick check of all wikis in prod that report $wgLanguageCode == 'zh':" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/350312 (https://phabricator.wikimedia.org/T163829) (owner: 10Tjones) [18:48:21] !log mobrovac@tin Finished deploy [restbase/deploy@4c1cdd0]: (no justification provided) (duration: 04m 36s) [18:48:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:48:33] ACKNOWLEDGEMENT - Work requests waiting in Zuul Gearman server https://grafana.wikimedia.org/dashboard/db/zuul-gearman on contint1001 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [140.0] amusso Unusual amount of patches sent to mediawiki/extensions [18:48:35] hashar: Are you trying to tell me the same thing RainbowSprinkles did in #mediawiki-core about pushing with +2? [18:49:02] (03PS1) 10Cmjohnson: Adding mac address for analytics1069 T162216 [puppet] - 10https://gerrit.wikimedia.org/r/358777 [18:49:28] anomie: I guess :] I am not actively looking at that irc channel :/ [18:51:02] (03CR) 10Cmjohnson: [C: 032] Adding mac address for analytics1069 T162216 [puppet] - 10https://gerrit.wikimedia.org/r/358777 (owner: 10Cmjohnson) [18:54:47] !log starting to delete all rows from linter tables on large wikis - T167758 [18:54:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:54:56] T167758: Need to empty "linter" database table on large wikis - https://phabricator.wikimedia.org/T167758 [18:56:24] !log reinstalling netmon1002 with stretch - scheduled icinga downtime [18:56:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:58:15] (03PS1) 10Ema: VCL: do not rate limit requests from IPs in wikimedia_trust [puppet] - 10https://gerrit.wikimedia.org/r/358779 (https://phabricator.wikimedia.org/T163233) [19:00:04] RainbowSprinkles: Respected human, time to deploy MediaWiki train (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170613T1900). Please do the needful. [19:00:53] 10Operations, 10Traffic, 10Community-Liaisons (Jul-Sep 2017): Communicate this security change to affected editors and other community members - https://phabricator.wikimedia.org/T163251#3345721 (10Johan) Sorry, it seems like this wasn't picked up by anyone. @BBlack, do you want this to be communicated as so... [19:01:25] (03CR) 10BBlack: [C: 031] VCL: do not rate limit requests from IPs in wikimedia_trust [puppet] - 10https://gerrit.wikimedia.org/r/358779 (https://phabricator.wikimedia.org/T163233) (owner: 10Ema) [19:01:45] !log mobrovac@tin Started deploy [restbase/deploy@9a86d4c]: (no justification provided) [19:01:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:03:18] RECOVERY - Restbase root url on restbase2001 is OK: HTTP OK: HTTP/1.1 200 - 15560 bytes in 0.011 second response time [19:04:42] (03PS2) 10Dzahn: wikistats: fix group, cron for xml dump, no spam [puppet] - 10https://gerrit.wikimedia.org/r/358644 [19:06:42] (03CR) 10Chad: [C: 032] group0 to wmf.5 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/358641 (owner: 10Chad) [19:06:44] (03PS2) 10Ema: VCL: do not rate limit requests from IPs in wikimedia_trust [puppet] - 10https://gerrit.wikimedia.org/r/358779 (https://phabricator.wikimedia.org/T163233) [19:07:27] RECOVERY - Restbase root url on restbase2005 is OK: HTTP OK: HTTP/1.1 200 - 15560 bytes in 0.015 second response time [19:08:24] !log netmon1002 - reinstallled with stretch, revoked puppet cert, salt key, signing new cert, accepting new key, initial puppet run (T159756) [19:08:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:08:34] T159756: setup netmon1002.wikimedia.org - https://phabricator.wikimedia.org/T159756 [19:08:37] RECOVERY - Restbase root url on restbase1010 is OK: HTTP OK: HTTP/1.1 200 - 15560 bytes in 0.085 second response time [19:09:18] !log mobrovac@tin Finished deploy [restbase/deploy@9a86d4c]: (no justification provided) (duration: 07m 33s) [19:09:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:10:34] (03Merged) 10jenkins-bot: group0 to wmf.5 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/358641 (owner: 10Chad) [19:12:08] (03CR) 10jenkins-bot: group0 to wmf.5 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/358641 (owner: 10Chad) [19:12:19] !log demon@tin rebuilt wikiversions.php and synchronized wikiversions files: group0 to wmf.5 [19:12:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:13:05] 10Operations, 10RESTBase, 10RESTBase-API, 10Traffic, 10Services (next): RESTBase support for www.wikimedia.org missing - https://phabricator.wikimedia.org/T133178#3345748 (10GWicke) To summarize the options using a single domain only: ## Use www.wikimedia.org only ### Pros - Follows the common www. co... [19:13:18] (03CR) 10Ema: [C: 032] VCL: do not rate limit requests from IPs in wikimedia_trust [puppet] - 10https://gerrit.wikimedia.org/r/358779 (https://phabricator.wikimedia.org/T163233) (owner: 10Ema) [19:14:05] (03CR) 10Dzahn: [C: 032] wikistats: fix group, cron for xml dump, no spam [puppet] - 10https://gerrit.wikimedia.org/r/358644 (owner: 10Dzahn) [19:14:10] (03PS3) 10Dzahn: wikistats: fix group, cron for xml dump, no spam [puppet] - 10https://gerrit.wikimedia.org/r/358644 [19:14:38] (03PS2) 10Dzahn: replace references to RT tickets with Phab ticket numbers [puppet] - 10https://gerrit.wikimedia.org/r/358500 (https://phabricator.wikimedia.org/T165733) [19:15:07] 10Operations, 10ops-codfw, 10Labs, 10Labs-Infrastructure, 10Patch-For-Review: rack/setup/install labtestneutron2002 - https://phabricator.wikimedia.org/T167160#3345754 (10RobH) [19:15:20] 10Operations, 10ops-codfw, 10Labs, 10Labs-Infrastructure, 10Patch-For-Review: rack/setup/install labtestnet2002 - https://phabricator.wikimedia.org/T167159#3345755 (10RobH) a:05Papaul>03RobH [19:15:38] (03Abandoned) 10Dzahn: rancid: add APT pin to jessie-backports release [puppet] - 10https://gerrit.wikimedia.org/r/358645 (https://phabricator.wikimedia.org/T159756) (owner: 10Dzahn) [19:15:47] 10Operations, 10ops-codfw, 10Labs, 10Labs-Infrastructure, 10Patch-For-Review: rack/setup/install labtestneutron2002 - https://phabricator.wikimedia.org/T167160#3319508 (10RobH) I had listed raid10 it seems in the setup, but its a two disk system, so fixed and installing. [19:16:18] Hey, Ops, there is a UBN! task https://phabricator.wikimedia.org/T167819#3345654 [19:16:27] it seems ores is mostly down [19:16:41] but it seems scb1001 is the main reason [19:16:43] (03CR) 10Dzahn: [C: 031] irc echo: Convert package from python-irclib to python-irc [puppet] - 10https://gerrit.wikimedia.org/r/358626 (owner: 10Paladox) [19:16:51] Amir1, want to try to roll back code just in case? [19:16:54] https://grafana.wikimedia.org/dashboard/db/ores?panelId=4&fullscreen&orgId=1&from=now-24h&to=now [19:16:55] It seems impossible but w/e [19:17:11] halfak: the errors started before we deploy [19:17:15] right [19:17:20] 10Operations, 10netops: set up a looking glass for WMF ASes - https://phabricator.wikimedia.org/T106056#3345763 (10ayounsi) After looking at the various looking glass, bird-lg seems indeed the best option (doesn't need ssh access to the routers, open-source, user-friendly, supports multiple regions). That's wh... [19:17:20] I think scb1001 is having hardware issues [19:17:45] akosiaris likes to give us a hard time when we don't roll back even when it doesn't seem to make sense to do so ;) [19:17:51] halfak: https://grafana.wikimedia.org/dashboard/db/ores?panelId=4&fullscreen&orgId=1&from=now-24h&to=now [19:18:20] click on scb1001 and compare it with other scb nodes in eqiad [19:18:57] scb1001 seems alive and doing things. and task was downgraded from UBN to High? [19:18:58] One thing that we can do for now is to switch traffic to codfw [19:19:09] mutante, I just brought it back up to UBN [19:19:10] and then back to UBN again [19:19:16] https://grafana.wikimedia.org/dashboard/db/ores-extension?orgId=1 [19:19:30] it fails in 80% proper responses now [19:19:33] it shouldn't [19:19:39] mutante, if it's not clear, errors started before our deploy. No icigna AFAICT [19:19:49] We just happened to notice the issue while checking on our deploy. [19:19:55] mutante: I'm guessing it hit swap [19:20:13] the server seems alive and answering requests [19:20:24] causing it to become slow [19:20:45] https://ganglia.wikimedia.org/latest/graph.php?r=4hr&z=xlarge&h=scb1001.eqiad.wmnet&m=bytes_out&s=by+name&mc=2&g=mem_report&c=Service+Cluster+B+eqiad [19:20:54] there is a lot of "changeprop" user activity [19:21:32] hmm, strange [19:21:45] mutante: https://grafana.wikimedia.org/dashboard/db/ores?panelId=4&fullscreen&orgId=1&from=now-24h&to=now [19:21:58] click on scb1001 and compare it to other eqiad nodes, [19:21:59] FWIW, it seems that scb1001 is the only node that regularly 500s [19:22:06] 1600 UTC is the time it all started [19:23:20] pdfrendering is using a lot of CPU [19:23:35] mutante, yeah, that's been pegged since I started looking [19:24:37] !log demon@tin Synchronized php-1.30.0-wmf.5/extensions/CategoryTree/CategoryPageSubclass.php: Fix up variable visibility (duration: 00m 44s) [19:24:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:24:52] !log scb1001 - killed process 10971 (pdfrendering/electron) [19:25:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:25:10] I have confirmed that it is *not* wsgi on scb1001 because I can retrieve a cached score over and over again without a 500. [19:25:26] Damn I wish there was log output for this 500 [19:26:05] the cache hit rate is falling since 1600 https://grafana.wikimedia.org/dashboard/db/ores?panelId=12&fullscreen&orgId=1&from=now-24h&to=now [19:26:10] any specific thing you want me to look for? [19:26:48] mutante, any ideas for what we could do with the uwsgi ORES is running on for scb1001 in order to find out why it is returning a 500? [19:27:17] *intermittently and seemingly only when it actually needs to communicate with celery workers* [19:27:32] halfak: that sounds like a redis connection issue [19:27:43] /var/log/uwsgi/app is a thing but it's empty [19:27:44] We can read from the cache just fine [19:27:47] Different connection [19:27:57] /var/log/ores/app.log [19:28:21] /var/log/ores doesn't exist on scb1001 [19:28:25] woops [19:28:40] /srv/log/ores/app.log [19:28:43] * halfak pastes this time [19:29:01] ores.wsgi.util.ParamError: Could not interpret revids. invalid literal for int() with base 10: 'last' [19:29:12] 10Operations, 10ops-eqiad: rack/setup/install ores100[1-9] - https://phabricator.wikimedia.org/T167808#3345810 (10RobH) 05Open>03declined this task is a dupe of T165171 [19:29:14] that's an old error [19:29:21] File hasn't been written to in hours [19:29:29] someone is using our service incorrectly, that's fine [19:29:41] mutante: can you check if connections to redis nodes on scb1001 is fine comparing to others [19:29:47] 10Operations, 10netops: set up a looking glass for WMF ASes - https://phabricator.wikimedia.org/T106056#3345819 (10ayounsi) a:03ayounsi [19:29:55] /srv/log/ores/main.log shows current activity [19:29:56] hmm, HDD issue happened before to us too [19:30:00] they are 200s [19:30:02] mutante, right [19:30:28] mutante, try this: [19:30:30] $ wget '127.0.0.1:8081/v3/scores/ruwiki/126351?features' -O- [19:30:32] (03CR) 10Tjones: "Surprise!! cnwikipedia should be fine, too." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/350312 (https://phabricator.wikimedia.org/T163829) (owner: 10Tjones) [19:30:56] 10Operations, 10ops-eqiad: rack/setup/install ores1001-1009 - https://phabricator.wikimedia.org/T165171#3258960 (10RobH) a:05akosiaris>03Cmjohnson Confirming racking is likely correct, since that is how we just racked all the codfw ores systems as well, so 2 per row, none in the same rack, and one row will... [19:30:57] halfak: yes, that shows a 500 [19:31:03] intermittently! [19:31:14] no other eqiad scb nodes have any 500s at all [19:31:17] (03CR) 10Tjones: "*cnwikimedia —dagnabbit I keep doing that." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/350312 (https://phabricator.wikimedia.org/T163829) (owner: 10Tjones) [19:31:19] lots of 500 in the log [19:31:30] [2017-06-13T19:30:26] [pid: 2134] 10.2.2.10 (-) {32 vars in 685 bytes} [Tue Jun 13 19:30:25 2017] GET /v2/scores/enwiki/?models=reverted%7Cdamaging%7Cgoodfaith&revids=785480198&precache=true&format=json => generated 2813 bytes in 88 msecs (HTTP/1.1 500) 6 headers in 228 bytes (1 switches on core 0) user agent "ChangePropagation/WMF" [19:32:22] yep, all confirmed. intermittently. one worked [19:33:05] 10Operations, 10ops-codfw, 10Labs, 10Labs-Infrastructure, 10Patch-For-Review: rack/setup/install labtestneutron2002 - https://phabricator.wikimedia.org/T167160#3345834 (10RobH) [19:33:10] 10Operations, 10ops-codfw, 10Labs, 10Labs-Infrastructure, 10Patch-For-Review: rack/setup/install labtestnet2002 - https://phabricator.wikimedia.org/T167159#3345835 (10RobH) [19:33:14] I wonder if that's related to what I just saw... [19:33:22] Oh? [19:33:26] 10Operations, 10netops: Faulty link between cr2-codfw and cr1-eqdfw - https://phabricator.wikimedia.org/T167261#3345837 (10ayounsi) Circuit identified and troubleshooting started by CyrusOne. [19:33:36] nvm. [19:33:50] damn [19:34:15] (03PS1) 10Smalyshev: Add "latest" links to TTL dumps [puppet] - 10https://gerrit.wikimedia.org/r/358783 (https://phabricator.wikimedia.org/T164783) [19:34:17] maybe we restart the ores/celery services on scb1001? [19:34:28] Could be that an OOM caused instability at some point? [19:35:18] halfak: on it [19:35:27] happy to try it. i don't see the typical signs of hardware errors [19:35:30] ok, please do it [19:36:10] 10Operations, 10Labs, 10Labs-Infrastructure: rack/setup/install labtestneutron2002 - https://phabricator.wikimedia.org/T167160#3345845 (10RobH) a:05RobH>03chasemp Assigning to @chasemp for service implementation. I also removed the onsite project tags, since all onsite work is completed for this system. [19:36:13] 10Operations, 10Labs, 10Labs-Infrastructure: rack/setup/install labtestnet2002 - https://phabricator.wikimedia.org/T167159#3345848 (10RobH) Assigning to @chasemp for service implementation. I also removed the onsite project tags, since all onsite work is completed for this system. [19:36:21] 10Operations, 10Labs, 10Labs-Infrastructure: rack/setup/install labtestnet2002 - https://phabricator.wikimedia.org/T167159#3345850 (10RobH) a:05RobH>03chasemp [19:36:30] mutante, is there way we could check on the stability of the redis connection? [19:36:32] since it's only one node that having trouble. It is kinda weird to me [19:37:19] !log restarting ores-related services in scb1001 (T167819) [19:37:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:37:29] T167819: ORES in eqiad is unhappy - https://phabricator.wikimedia.org/T167819 [19:37:43] redis://:OMGPASSWORD@oresrdb.svc.eqiad.wmnet:6379 [19:38:07] ottomata or elukey: if I get puppet installed on an1069 will it break anything? this is the dropped server replacement [19:38:20] Amir1, looks like it's recovered [19:38:24] the median is rising to 1.7 minutes https://grafana.wikimedia.org/dashboard/db/ores?orgId=1&panelId=15&fullscreen [19:38:24] yay [19:38:37] 20/20 200 responses [19:38:38] i started to look at redis-cli . but this is good :) [19:39:05] oh well, good old "have you tried restarting it" i guess [19:39:32] It started to decrease from some minutes ago https://grafana.wikimedia.org/dashboard/db/ores-extension?orgId=1&from=now-1h&to=now [19:39:58] so i did one actual thing. killed that pdfrendering process [19:40:25] Hmm... I guess that could have been related. Wasn't using much memory [19:40:25] mutante: can you get exact time? [19:40:55] 10Operations, 10ops-eqiad, 10Patch-For-Review: rack and set up analyics1058-1069 - https://phabricator.wikimedia.org/T162216#3345877 (10Cmjohnson) @elukey @ottomata analytics1069 is installed, I stopped short of getting puppet running. I wasn't sure if you already had a config for this and did not want to ch... [19:41:03] Amir1: 19:24 mutante: scb1001 - killed process 10971 (pdfrendering/electron) [19:41:12] 20 minutes ago [19:41:16] mutante: Yes, that's the reason [19:41:16] https://grafana.wikimedia.org/dashboard/db/ores?orgId=1&panelId=3&fullscreen [19:41:28] look when ores in scb1001 is going to take more requests [19:41:31] heh, i was gonna ask "did it get any better" [19:42:25] mutante: it should be leashed [19:42:40] who is responsible for that? [19:42:59] for electron? hmm [19:44:02] finds https://phabricator.wikimedia.org/project/members/2162/ [19:45:59] we haven't moved to it yet https://phabricator.wikimedia.org/T146757 [19:47:05] https://phabricator.wikimedia.org/T143129 [19:47:37] this moved it to scb [19:48:03] finds a place to leave a comment [19:50:13] mutante: I added in https://phabricator.wikimedia.org/T167819 [19:50:16] do you think this should count as outage? [19:50:21] with a report [19:50:35] mutante & Amir1: yeah. I think this needs a report. [19:50:35] yeah definitely [19:50:39] It's going to be a weird report. [19:51:01] Something happened? We futzed around for a bit and eventually restarted uwsgi/celery and it went away. [19:51:01] halfak: I think it is still worth checking redis connections too [19:52:16] Amir1: commented also on https://phabricator.wikimedia.org/T143129#3345950 [19:53:05] mutante: great thanks [19:53:11] I'm calling it a day [19:53:23] Amir1, good sleep! I'll get the incident filed [19:53:29] 10Operations, 10ops-eqiad, 10Patch-For-Review: rack and set up analyics1058-1069 - https://phabricator.wikimedia.org/T162216#3345958 (10Ottomata) Thanks @cmjohnson, we'll take it from here then! Appreciated! [19:53:42] Amir1: ok, np. good night [19:54:08] halfak: thanks! [19:55:20] is it ok to lower from UBN then? [19:55:34] mutante: it's okay to resolve it I guess [19:55:38] i'll help with the report [19:55:43] ok, cool [19:55:49] o/ [19:56:09] !log mobrovac@tin Started deploy [restbase/deploy@4c1cdd0]: (no justification provided) [19:56:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:58:21] PROBLEM - Restbase root url on restbase2001 is CRITICAL: connect to address 10.192.16.152 and port 7231: Connection refused [19:58:33] known ^ [20:00:39] !log mobrovac@tin Finished deploy [restbase/deploy@4c1cdd0]: (no justification provided) (duration: 04m 29s) [20:00:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:00:54] mutante, is this something that could have been noticed earlier with icinga? E.g. maybe we should have a check for each node individually? [20:01:01] * halfak thinks about followup tasks. [20:02:00] !log demon@tin Synchronized php-1.30.0-wmf.5/includes/api/ApiParse.php: T167826 (duration: 00m 44s) [20:02:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:02:09] T167826: ApiParse: Call to member function getModel() on non-object - https://phabricator.wikimedia.org/T167826 [20:02:17] (03PS1) 10Hashar: swift: save nscd CPU by using IP address [puppet] - 10https://gerrit.wikimedia.org/r/358799 (https://phabricator.wikimedia.org/T160990) [20:07:34] halfak: yea, probably. there is 5xx error detection where Icinga asks graphite for the error rate https://icinga.wikimedia.org/cgi-bin/icinga/status.cgi?search_string=5xx [20:08:27] halfak: unless there is an even better way to check when actual users see errors [20:11:15] mutante, cool I'll add those notes to a phab card. Thanks :) [20:11:18] the best monitoring would be if it tests something at a high level, behaves like a user [20:11:23] yw [20:14:02] 10Operations, 10Labs, 10Labs-Infrastructure, 10Patch-For-Review, 10Wikimedia-Incident: Some labs instances IP have multiple PTR entries in DNS - https://phabricator.wikimedia.org/T115194#3346114 (10Andrew) p:05High>03Normal right now I'm just checking periodically to see if there are new leaks. [20:17:46] 10Operations, 10ops-eqiad, 10Dumps-Generation, 10Patch-For-Review: rack/setup/install dumpsdata100[12] - https://phabricator.wikimedia.org/T165173#3346127 (10RobH) Ok, the error has now gone away. It may have been an odd condition where the raid was rebuilding, but not certain. Now the question is how to... [20:23:07] (03CR) 10Platonides: "I don't really like that "192.0.2.1" default value, chosen just because that is a private IP not used on wikimedia. I would go with "0.0.0" [puppet] - 10https://gerrit.wikimedia.org/r/358779 (https://phabricator.wikimedia.org/T163233) (owner: 10Ema) [20:37:24] !log Restarting Nodepool. apparently confused in pool tracking and spawning to many Trusty nodes (7 instead of 4) [20:37:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:41:28] 10Operations, 10Traffic, 10Community-Liaisons (Jul-Sep 2017): Communicate this security change to affected editors and other community members - https://phabricator.wikimedia.org/T163251#3346178 (10Johan) And just to make sure I understand what's to be done: //The problem// IE8-on-XP will no longer be supp... [20:44:31] keyholder on jessie vs. stretch. same RSA private key, can be used fine on jessie. on stretch it says "is not an acceptable key. Is it an RSA or ED25519 key with passphrase", but yes, it is both, RSA and passphrase and same file that puppet installed [20:48:19] (03PS1) 10BBlack: VCL: raise ratelimit for RB, exclude labs from limits [puppet] - 10https://gerrit.wikimedia.org/r/358860 (https://phabricator.wikimedia.org/T163233) [20:50:14] (03CR) 10BBlack: [C: 032] VCL: raise ratelimit for RB, exclude labs from limits [puppet] - 10https://gerrit.wikimedia.org/r/358860 (https://phabricator.wikimedia.org/T163233) (owner: 10BBlack) [20:52:06] 10Operations, 10ops-eqiad, 10Patch-For-Review: rack and set up analyics1058-1069 - https://phabricator.wikimedia.org/T162216#3346221 (10Ottomata) @cmjohnson, I was about to do this, but I need to know which row and rack it is in. I can see its in Row D, but which rack? [20:52:33] ah, i think the output of ssh-keygen changed [20:52:43] had to deal with this before [20:52:59] the "validity" check relies on ssh-keygen .. | grep .. [20:53:13] mutante: ewww. we should fix that [20:53:41] yes, i'm trying to do that [20:53:52] figuring out the exact change right now [20:54:24] * Platonides suspects that might originally be coded by him [20:54:53] striker uses https://pypi.python.org/pypi/sshpubkeys [20:55:22] PROBLEM - Keyholder SSH agent on netmon1002 is CRITICAL: CRITICAL: Keyholder is not armed. Run keyholder arm to arm it. [20:55:25] https://github.com/wikimedia/labs-striker/blob/master/striker/profile/utils.py [20:55:54] I guess this is probably private keys though in keyholder? [21:00:59] ottomata: d8 [21:02:21] 10Operations, 10ops-eqiad, 10Dumps-Generation, 10Patch-For-Review: rack/setup/install dumpsdata100[12] - https://phabricator.wikimedia.org/T165173#3346228 (10RobH) chatted with ariel via irc: sda = hardware raid 1 of 2 1TB disks for a 999.7GB disk sdb = hardware raid 10 of 12 4TB disks for a 24TB disk Th... [21:02:22] !log demon@tin Synchronized php-1.30.0-wmf.5/extensions/CentralAuth/includes/CentralAuthHooks.php: Fix bad method name (duration: 00m 44s) [21:02:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:04:21] bd808: yes, it's about loading private keys to arm the keyholder [21:04:25] (03PS1) 10RobH: dumpsdata100[12] new partman recipe [puppet] - 10https://gerrit.wikimedia.org/r/358880 [21:07:52] (03PS11) 10Mforns: Add white-list for EventLogging auto-purging [puppet] - 10https://gerrit.wikimedia.org/r/298721 (https://phabricator.wikimedia.org/T108850) [21:09:40] (03CR) 10RobH: [C: 032] dumpsdata100[12] new partman recipe [puppet] - 10https://gerrit.wikimedia.org/r/358880 (owner: 10RobH) [21:09:45] (03PS2) 10RobH: dumpsdata100[12] new partman recipe [puppet] - 10https://gerrit.wikimedia.org/r/358880 [21:10:01] !log ppchelko@tin Started deploy [changeprop/deploy@4ba3c59]: Rate-limiter enhancements [21:10:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:11:09] !log ppchelko@tin Finished deploy [changeprop/deploy@4ba3c59]: Rate-limiter enhancements (duration: 01m 08s) [21:11:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:13:30] !log Gracefully restarting Zuul [21:13:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:13:55] (03CR) 10BBlack: "The problem with 0.0.0.0 in the general case is that when we're matching ACLs sometimes implementations treat all-zeros specially as an AN" [puppet] - 10https://gerrit.wikimedia.org/r/358779 (https://phabricator.wikimedia.org/T163233) (owner: 10Ema) [21:16:47] (03CR) 10Legoktm: [C: 031] "Just to clarify, this affects all projects where the subdomain has a - in it? +1 as this affects quite a few projects: legoktm: yes, someone noted in T167492 that mobile redirect wasn't working for the zh-classical wiki, which prompted this [21:18:03] T167492: Accessing zh-classical.wikipedia.org on a mobile device does not redirect to zh-classical.m.wikipedia.org - https://phabricator.wikimedia.org/T167492 [21:18:24] legoktm: just wasn't sure if maybe that was intentional because these subdomains were special in a different way that didn't want mobile redirects [21:18:24] ok :) [21:18:33] I don't think so [21:29:35] !log mobrovac@tin Started deploy [restbase/deploy@9a86d4c]: (no justification provided) [21:29:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:30:40] !log mobrovac@tin Finished deploy [restbase/deploy@9a86d4c]: (no justification provided) (duration: 01m 06s) [21:30:41] RECOVERY - Work requests waiting in Zuul Gearman server https://grafana.wikimedia.org/dashboard/db/zuul-gearman on contint1001 is OK: OK: Less than 30.00% above the threshold [90.0] [21:30:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:31:21] RECOVERY - Restbase root url on restbase2001 is OK: HTTP OK: HTTP/1.1 200 - 15560 bytes in 0.009 second response time [21:33:05] (03PS1) 10Odder: Upload logos for the Dinka Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/358883 [21:34:12] I am manually purging the CI queue [21:34:19] it is holding all changes for now [21:37:11] (03PS1) 10Dzahn: keyholder: add stretch support, fix key validity check [puppet] - 10https://gerrit.wikimedia.org/r/358884 [21:38:20] (03CR) 10Paladox: [C: 031] "He showed me on irc the difference on jessie/stretch. Different error output's too." [puppet] - 10https://gerrit.wikimedia.org/r/358884 (owner: 10Dzahn) [21:39:54] hashar: something up? [21:40:08] chasemp: yeah overload of CI due to too many patchsets [21:40:13] and Zuul being screwed up due to a bug :( [21:40:13] ah [21:40:22] good news: it is nothing related to the cloud !!! [21:40:24] even more fun, ok [21:40:45] ha, well, not reveling in your misery but it's something [21:40:52] (03CR) 10Dzahn: "jessie:" [puppet] - 10https://gerrit.wikimedia.org/r/358884 (owner: 10Dzahn) [21:41:04] (03CR) 10Platonides: [C: 031] keyholder: add stretch support, fix key validity check [puppet] - 10https://gerrit.wikimedia.org/r/358884 (owner: 10Dzahn) [21:41:17] I made a mistake earlier trying to gracefully stop Zuul to clean up a dirty state [21:41:31] but it still needs to finish processing the jobs queue :( [21:41:50] (03PS2) 10Dzahn: keyholder: add stretch support, fix key validity check [puppet] - 10https://gerrit.wikimedia.org/r/358884 [21:45:22] RECOVERY - Keyholder SSH agent on netmon1002 is OK: OK: Keyholder is armed with all configured keys. [21:45:35] queue almost flushed [21:46:04] !log netmon1002 - was able to "keyholder arm" after stretch install after applying https://gerrit.wikimedia.org/r/358884 as hotfix [21:46:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:46:49] * paladox wonders what the zuul bug was? [21:47:09] paladox: I will write about it tomorrow I guess [21:47:18] ok. thanks. [21:52:11] PROBLEM - zuul_gearman_service on contint1001 is CRITICAL: connect to address 127.0.0.1 and port 4730: Connection refused [21:52:18] and of course Zuul refuses to start :( [21:52:51] PROBLEM - zuul_service_running on contint1001 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/share/python/zuul/bin/python /usr/bin/zuul-server [21:53:33] :( [21:54:21] Was something upgraded recently? [21:54:34] https://en.wikisource.org/w/index.php?title=Page:Library_Construction,_Architecture,_Fittings,_and_Furniture.djvu/52&action=edit&redlink=1 doesn't work [21:54:39] properly [21:55:13] RECOVERY - zuul_gearman_service on contint1001 is OK: TCP OK - 0.000 second response time on 127.0.0.1 port 4730 [21:55:51] RECOVERY - zuul_service_running on contint1001 is OK: PROCS OK: 2 processes with regex args ^/usr/share/python/zuul/bin/python /usr/bin/zuul-server [21:56:04] :( [21:56:16] !log Zuul back, running in an interactive terminal. [21:56:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:56:57] ShakespeareFan00: wikisources were updated last week [21:58:10] !log restarted pdfrender on scb1002 and scb1004; was spinning on CPU [21:58:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:59:33] !log restarted pdfrender on scb1003; was spinning on CPU & using 15G of memory (!) [21:59:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:01:41] PROBLEM - pdfrender on scb1003 is CRITICAL: connect to address 10.64.32.153 and port 5252: Connection refused [22:05:18] Okay somethings borken [22:05:20] *broken [22:05:32] https://en.wikisource.org/wiki/File:Library_Construction,_Architecture,_Fittings,_and_Furniture.djvu won't display [22:05:53] Or does so VERY slowly [22:08:58] Pretty quick for me [22:09:24] (03PS1) 10Legoktm: Deploy Linter to all wikis (try #2) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/358887 (https://phabricator.wikimedia.org/T148609) [22:09:37] Different pages have an initial lag because it's freshly uploaded and so needs all its thumbs generated. [22:09:41] But a refresh is fast [22:18:53] (03PS1) 10GWicke: Limit Electron memory usage to 2G [puppet] - 10https://gerrit.wikimedia.org/r/358888 (https://phabricator.wikimedia.org/T167834) [22:20:28] 10Operations, 10Labs: virbr0 interface present in some virt hosts - https://phabricator.wikimedia.org/T83732#3346434 (10chasemp) Post hoc note. I noticed that `/etc/libvirt/qemu/networks/autostart/default.xml` is ensured absent in our nova compute role. This is a file that libvirt seems to generate and the c... [22:27:06] (03PS1) 10RobH: tweaking dumpsdata partman recipe [puppet] - 10https://gerrit.wikimedia.org/r/358889 [22:27:17] (03CR) 10jerkins-bot: [V: 04-1] tweaking dumpsdata partman recipe [puppet] - 10https://gerrit.wikimedia.org/r/358889 (owner: 10RobH) [22:27:26] well... thats sad [22:27:32] (03PS2) 10RobH: tweaking dumpsdata partman recipe [puppet] - 10https://gerrit.wikimedia.org/r/358889 [22:29:16] (03CR) 10RobH: [C: 032] tweaking dumpsdata partman recipe [puppet] - 10https://gerrit.wikimedia.org/r/358889 (owner: 10RobH) [22:31:57] =P ci slowwwww [22:32:13] hurry up zuul i wanna test shit [22:32:18] (03PS3) 10Dzahn: keyholder: add stretch support, fix key validity check [puppet] - 10https://gerrit.wikimedia.org/r/358884 (https://phabricator.wikimedia.org/T159756) [22:32:37] 10Operations, 10netops: Merge AS14907 with AS43281 - https://phabricator.wikimedia.org/T167840#3346480 (10faidon) [22:33:28] 10Operations, 10Patch-For-Review: setup netmon1002.wikimedia.org - https://phabricator.wikimedia.org/T159756#3346494 (10Dzahn) Could not "keyholder arm" on stretch. Reason turns out to be the output of ssh-keygen changed, which keyholder relies on: 21:46 mutante: netmon1002 - was able to "keyholder arm" afte... [22:33:49] 10Operations, 10netops: Merge AS14907 with AS43281 - https://phabricator.wikimedia.org/T167840#3346497 (10faidon) [22:45:06] 10Operations, 10netops: Cleanup confed BGP peerings and policies - https://phabricator.wikimedia.org/T167841#3346525 (10faidon) [22:47:24] 10Operations, 10netops: Find a new PIM RP IP - https://phabricator.wikimedia.org/T167842#3346542 (10faidon) [22:58:09] jouncebot: next [22:58:20] In 0 hour(s) and 1 minute(s): Evening SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170613T2300) [22:59:23] Thanks jouncebot. [23:00:04] addshore, hashar, anomie, RainbowSprinkles, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, and thcipriani: Respected human, time to deploy Evening SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170613T2300). Please do the needful. [23:00:04] andrewbogott: A patch you scheduled for Evening SWAT (Max 8 patches) is about to be deployed. Please be available during the process. [23:00:26] evening swat, aka "oh crap, it's already 4?!" [23:00:52] andrewbogott: I got ya handled, should be merged shortly [23:00:53] ahahah [23:01:11] greg-g: For me today it's "finally, it's 4." Today has been slowwwwwwwwww [23:02:11] PROBLEM - zuul_gearman_service on contint1001 is CRITICAL: connect to address 127.0.0.1 and port 4730: Connection refused [23:02:51] PROBLEM - zuul_service_running on contint1001 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/share/python/zuul/bin/python /usr/bin/zuul-server [23:02:59] I have stopped zuul [23:03:10] RainbowSprinkles: thank you! [23:03:11] (03PS3) 10Dzahn: replace references to RT tickets with Phab ticket numbers [puppet] - 10https://gerrit.wikimedia.org/r/358500 (https://phabricator.wikimedia.org/T165733) [23:03:25] RainbowSprinkles: great, I'm here to test when it's deployed. [23:04:10] !log demon@tin Synchronized php-1.30.0-wmf.4/extensions/OpenStackManager: Re-adding deleted special page (duration: 00m 45s) [23:04:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:04:20] andrewbogott: You're live ^ [23:04:59] !log netmon1001/1002: rsynced /var/lib/rancid/CVS and /var/lib/rancid/GIT from 1001 to 1002 for rancid migration (T159756) [23:05:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:05:08] T159756: setup netmon1002.wikimedia.org - https://phabricator.wikimedia.org/T159756 [23:05:11] RainbowSprinkles: works! [23:05:15] Yay [23:05:54] anyone knows about systemd ? I can't start zuul anymore on contint1001.wikimedia.org [23:05:58] and I can't figure out why :( [23:06:47] hashar: taking a look [23:07:34] at least puppet has [23:07:34] zuul::server::service_ensure: running [23:07:35] zuul::server::service_enable: true [23:07:55] it says zuul is active [23:08:08] will try restarting it [23:08:08] yeah the unit [23:08:16] Active: active (exited) since Thu 2017-04-06 08:29:39 UTC; 2 months 7 days ago [23:08:43] was there a recent change to systemd [23:08:47] but it doesn't show up in ps -u zuul f [23:08:47] (03CR) 10Faidon Liambotis: [C: 04-2] "Per task." [puppet] - 10https://gerrit.wikimedia.org/r/358799 (https://phabricator.wikimedia.org/T160990) (owner: 10Hashar) [23:08:55] I dont think I changed anything [23:09:39] Worker .. starting reconnect for Gea.. [23:09:47] (03PS2) 10Chad: ExtensionDistributor: Add REL1_29, drop REL1_23 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/356486 (owner: 10Legoktm) [23:09:51] GarmanJobServerSession [23:10:30] File "/usr/share/python/zuul/local/lib/python2.7/site-packages/zuul/connection/gerrit.py", line 387, in _ssh [23:10:33] raise Exception("Gerrit error executing %s" % command) [23:10:36] Exception: Gerrit error executing gerrit review --project mediawiki/extensions/CentralAuth --message "Main test build failed. [23:10:39] ? [23:10:48] where is that ? [23:10:57] /var/log/zuul/error.log [23:11:06] ah yeah but it is old [23:11:21] just that it can not write a comment back to gerrit. Not so important [23:13:07] Active: active (running) since Tue 2017-06-13 23:12:58 UTC; 3s ago [23:13:09] mutante: it started! [23:13:11] RECOVERY - zuul_gearman_service on contint1001 is OK: TCP OK - 0.000 second response time on 127.0.0.1 port 4730 [23:13:18] !log contint1001 - started zuul using the old init script [23:13:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:13:30] Yay, init system wars! [23:13:38] root@contint1001:/etc/systemd# /etc/init.d/zuul restart [23:13:38] [ ok ] Restarting zuul (via systemctl): zuul.service. [23:13:38] root@contint1001:/etc/systemd# /etc/init.d/zuul status [23:13:38] ● zuul.service - LSB: Zuul [23:13:39] Loaded: loaded (/etc/init.d/zuul) [23:13:41] Active: active (running) since Tue 2017-06-13 23:12:58 UTC; 3s ago [23:13:43] I love when gerrit does that too ;-) [23:13:51] RECOVERY - zuul_service_running on contint1001 is OK: PROCS OK: 2 processes with regex args ^/usr/share/python/zuul/bin/python /usr/bin/zuul-server [23:14:07] mutante: you are magi [23:14:08] c [23:14:20] https://phabricator.wikimedia.org/T167833 is the related task [23:14:43] I did a: VERBOSE=1 /etc/init.d/zuul start [23:15:58] I guess I will migrate zuul-server to be 100% behind systemd and drop the init script [23:16:32] don't use /etc/init.d/* anymore [23:16:39] is this .. a systemd unit file that uses the old init script as StartExec? [23:16:39] you weren't supposed to even pre-systemd [23:16:47] eeeeewwww [23:17:31] although from the above it sounds like it's just an init script? [23:18:10] yeah that's just systemd-sysv-generator [23:18:19] aka systemd's compatibility with sysv init scripts [23:18:27] ah, i see, yes [23:18:31] the deb package just ship a /etc/init.d/zuul + /etc/default/zuul [23:18:39] yeah [23:18:52] I guess it is time to phase that out [23:19:07] and migrate to a proper systemd service definition [23:19:15] native systemd units are always better, but an init script by itself shouldn't be an issue [23:19:31] proper init scripts, not gerrit's :) [23:19:46] by that I mean using /lib/lsb/init-functions etc. [23:20:24] i have a patch for that [23:20:32] https://gerrit-review.googlesource.com/#/c/108579/ [23:21:11] that's a bad patch [23:21:24] mutante: solved. Thank you very much :} [23:21:41] it's not just about including /lib/lsb/init-functions, it's also about using those functions [23:21:44] and also an LSB header [23:22:05] in any case, gerrit would be better off by shipping a systemd unit anyway [23:22:15] for zuul the init.d script comes from Precise era probably [23:22:22] paravoid: WIP ;-) [23:22:35] hashar: yw. and yes, let's convert it [23:22:41] RainbowSprinkles: :) [23:22:46] and moving to a systemd unit is straightforward (Daniel and I already did for the other daemon zuul-merger ) [23:22:50] mutante: definitely :-} [23:22:51] paravoid it does. [23:23:09] it introduces a systemd script in 2.13.* something which we are not running currently. [23:23:15] oh nice [23:23:17] but it's not the type of script we want [23:23:19] it uses sockets [23:23:30] socket activation, I suppose you mean? [23:23:50] https://github.com/GerritCodeReview/gerrit/blob/390e4bc46e344154c1628bc5682c7941e4a67b27/gerrit-pgm/src/main/resources/com/google/gerrit/pgm/init/gerrit.service [23:23:51] yep [23:24:03] https://github.com/GerritCodeReview/gerrit/blob/390e4bc46e344154c1628bc5682c7941e4a67b27/gerrit-pgm/src/main/resources/com/google/gerrit/pgm/init/gerrit.socket [23:24:47] 10Operations, 10Continuous-Integration-Infrastructure, 10Release-Engineering-Team (Kanban), 10Zuul: Migrate zuul-server behind systemd service - https://phabricator.wikimedia.org/T167845#3346626 (10hashar) [23:24:52] not necessarily a bad idea [23:25:09] but gerrit 2.14 init.s the file (by that i mean it puts it in bin/ where the init file is.) [23:25:14] mutante: filled the task. Will try to work on that next week :-}  Thanks again for the life saver! [23:25:37] hashar: sounds good. np :) [23:25:38] there's a 2.14 too? :) [23:25:43] which version are we running? [23:25:57] 2.13.4 [23:26:10] sleeping for real & [23:26:10] 2.14 introduces a new ui (experimental) [23:26:16] 2.13.8 was next, wasnt it [23:26:21] yep [23:26:33] yeah I've seen polygerrit before [23:26:38] these guys don't know how to make UIs [23:26:46] even with fancy new web frameworks :) [23:26:48] lol [23:27:00] paravoid: I've said that for years. [23:27:13] yeah, I remember you saying that on multiple occasions heh :) [23:28:23] you can file improvements for polygerrit here https://bugs.chromium.org/p/gerrit/issues/entry [23:29:12] !log gerrit: upgrading on master 2.13.4-13-gc0c5cc4742 -> 2.13.8-1-g7c438d37a2 (been running on slave for a week) [23:29:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:30:02] oh heh [23:30:24] 10Operations, 10ops-eqiad, 10Dumps-Generation, 10Patch-For-Review: rack/setup/install dumpsdata100[12] - https://phabricator.wikimedia.org/T165173#3346650 (10RobH) [23:30:33] 10Operations, 10ops-eqiad, 10Dumps-Generation: rack/setup/install dumpsdata100[12] - https://phabricator.wikimedia.org/T165173#3259047 (10RobH) [23:30:48] 18:42 mutante: built gerrit_2.13.8+git1-wmf.5 on copper (T158946) [23:30:48] T158946: Update gerrit to 2.13.8 - https://phabricator.wikimedia.org/T158946 [23:30:51] And yay, login error didn't come back on upgrade attempt #2 [23:31:04] haha proves them wrong, it's the cache [23:31:24] :) [23:32:41] PROBLEM - puppet last run on kafka2001 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[git_pull_mediawiki/event-schemas] [23:33:07] it feels faster indeed [23:33:16] re: performance improvement on gerrit [23:33:22] Yeh [23:33:32] they did a couple improvements due to it using h2. [23:34:03] aren't we proxying it with apache at the front? [23:34:13] Yep, it sits behind apache as a reverse proxy [23:34:23] yeah, so h2 makes no difference [23:34:37] I don't think you mean the same h2? [23:34:49] probably not, what did you mean? :) [23:34:52] h2 is a db gerrit uses to store reviewers. [23:34:57] ah [23:35:00] The embedded database format [23:35:06] https://en.wikipedia.org/wiki/H2_(DBMS) [23:35:43] I meant HTTP/2.0, H2 is an often used abbreviation for that [23:35:56] That's what I thought you meant and yeah you're totally right [23:36:00] h2 or h2c, it's even used in the protocols :) [23:36:30] (03PS1) 10RobH: typo in dumpsdata1002 reverse file [dns] - 10https://gerrit.wikimedia.org/r/358895 [23:37:27] (03CR) 10Dzahn: [C: 031] Limit Electron memory usage to 2G [puppet] - 10https://gerrit.wikimedia.org/r/358888 (https://phabricator.wikimedia.org/T167834) (owner: 10GWicke) [23:37:49] (03CR) 10RobH: [C: 032] typo in dumpsdata1002 reverse file [dns] - 10https://gerrit.wikimedia.org/r/358895 (owner: 10RobH) [23:37:53] (03PS2) 10RobH: typo in dumpsdata1002 reverse file [dns] - 10https://gerrit.wikimedia.org/r/358895 [23:38:04] so why are we using H2 instead of e.g. MySQL? [23:38:14] Thats a good question. [23:38:20] Answer is you can now with 2.13.8 [23:38:35] though some functions still need h2. But reviewers can be migrated to mysql. [23:40:37] by setting this config https://gerrit-review.googlesource.com/#/c/103373/23/Documentation/config-gerrit.txt [23:42:04] I tested that but got muddled up. You have to have a seperate db for this. You carn't re use reviewdb (gerrit db name). [23:42:12] otherwise you will get 500 errors. [23:43:13] (03CR) 10Dzahn: [C: 032] "only changes comment lines with ticket references" [puppet] - 10https://gerrit.wikimedia.org/r/358500 (https://phabricator.wikimedia.org/T165733) (owner: 10Dzahn) [23:46:29] "Notedb is the successor to ReviewDb: a replacement database backend for Gerrit. The goal is to read code review metadata from the same set of repositories that store the data. " [23:46:46] yep bad idea [23:46:48] it's slowwwww [23:46:55] heh [23:47:21] so they are moving stuff from H2 to MySQL but also stop using MySQL for reviewdb ? [23:47:38] No, none of this is an accurate history. [23:48:20] RainbowSprinkles they told me that one of the reason why they coulden't add support for mariadb was because reviewdb is deprecated. [23:48:30] Yes, that one bit is correct. [23:48:44] But you've been able to use H2 as a DB backend forever--that's not new. We use mysql though for obvious reasons. H2 is *still* used in that case for on-disk caching. [23:48:51] ah i see, mutante nope there not migrating h2 to mysql. [23:49:11] The optimization was for H2 on-disk stuff. Which...helps a bit for us, but not a ton (mysql latency would be a bigger problem) [23:49:29] And yes, they're trying to kill the reviewdb (which could be mysql, postgres, h2, you name it) [23:49:44] (a design decision I'm rather iffy about, but they've been going down this road for years) [23:49:59] 10Operations, 10ops-eqiad, 10Dumps-Generation, 10Patch-For-Review: rack/setup/install dumpsdata100[12] - https://phabricator.wikimedia.org/T165173#3346678 (10RobH) So there was a typo in the reverse entry for dumpsdata1002, and it caused it not to detect from the entry in netboot.cfg and use the auto parti... [23:50:01] RainbowSprinkles i guess that's why they are calling gerrit 3.0 sometimes. ReviewDB removal pending. [23:51:03] Too far in the future. I'm not even thinking about 2.14 yet! :) [23:51:17] 2.14 was meant to be 3.0 [23:53:45] (03CR) 10Aaron Schulz: [C: 032] Capture messages on 'autoloader' debug log channel [mediawiki-config] - 10https://gerrit.wikimedia.org/r/356841 (https://phabricator.wikimedia.org/T166759) (owner: 10Ori.livneh) [23:54:49] 10Operations, 10Continuous-Integration-Infrastructure, 10Release-Engineering-Team (Backlog): Make it possible to run the mediawiki testsuite against a staging repo of apt.wikimedia.org - https://phabricator.wikimedia.org/T157038#3346689 (10greg) [23:55:05] PROBLEM - Check systemd state on dumpsdata1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [23:55:52] 10Operations, 10Deployment-Systems, 10Beta-Cluster-reproducible, 10HHVM, and 2 others: Switch mwscript from Zend PHP5 to default php alternative (egHHVM) - https://phabricator.wikimedia.org/T146285#3346691 (10greg) [23:59:47] (03CR) 10Dzahn: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/358500 (https://phabricator.wikimedia.org/T165733) (owner: 10Dzahn)