[00:00:04] twentyafterfour: Dear anthropoid, the time has come. Please deploy Phabricator update (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160428T0000). [00:06:18] Krenair: Yay! Yes [00:07:42] Please bear with me while I undo l10n-bot's commit that broke CI in Echo [00:08:04] I won't be able to merge my SWAT patches otherwise [00:08:44] 06Operations, 10Traffic: Content purges are unreliable - https://phabricator.wikimedia.org/T133821#2245593 (10BBlack) [00:09:54] 06Operations, 10Traffic, 13Patch-For-Review: Decrease max object TTL in varnishes - https://phabricator.wikimedia.org/T124954#1970940 (10BBlack) [00:09:57] 06Operations, 10MediaWiki-Cache, 10MediaWiki-JobQueue, 10MediaWiki-JobRunner, and 3 others: Investigate massive increase in htmlCacheUpdate jobs in Dec/Jan - https://phabricator.wikimedia.org/T124418#1955365 (10BBlack) [00:10:01] 06Operations, 10Traffic, 07Varnish: Install XKey vmod - https://phabricator.wikimedia.org/T122881#1916107 (10BBlack) [00:10:04] 06Operations, 10Traffic: Content purges are unreliable - https://phabricator.wikimedia.org/T133821#2245614 (10BBlack) [00:10:24] 06Operations, 10Traffic: Content purges are unreliable - https://phabricator.wikimedia.org/T133821#2245593 (10BBlack) 05Open>03stalled p:05Triage>03High [00:10:37] 06Operations, 10Traffic: Content purges are unreliable - https://phabricator.wikimedia.org/T133821#2245593 (10BBlack) [00:10:40] 06Operations, 06Commons, 10Traffic, 10media-storage, and 2 others: upload-lb.ulsfo.wikimedia.org still allow access to some deleted files - https://phabricator.wikimedia.org/T133819#2245416 (10BBlack) [00:10:46] 06Operations, 10Traffic: Content purges are unreliable - https://phabricator.wikimedia.org/T133821#2245593 (10BBlack) [00:10:49] 06Operations, 06Commons, 10Traffic, 10media-storage, and 2 others: Deleted files sometimes remain visible to non-privileged users if permanently linked - https://phabricator.wikimedia.org/T109331#2245623 (10BBlack) [00:11:01] 06Operations, 10Traffic: Content purges are unreliable - https://phabricator.wikimedia.org/T133821#2245593 (10BBlack) [00:11:07] 06Operations, 06Commons, 10MediaWiki-File-management, 06Multimedia, and 2 others: Image cache issue when 'over-writing' an image on commons - https://phabricator.wikimedia.org/T119038#2245625 (10BBlack) [00:11:22] PROBLEM - Unmerged changes on repository puppet on strontium is CRITICAL: There is one unmerged change in puppet (dir /var/lib/git/operations/puppet). [00:13:00] TimStarling: your change above didn't make puppet-merge to strontium (due to a bug), fixed now. puppeting carbon again JIC. [00:13:12] RECOVERY - Unmerged changes on repository puppet on strontium is OK: No changes to merge. [00:16:53] thanks bblack [00:20:45] 06Operations, 10Traffic: Content purges are unreliable - https://phabricator.wikimedia.org/T133821#2245639 (10BBlack) [00:22:06] (03PS8) 10Ottomata: Add new confluent module and puppetization for using confluent Kafka [puppet] - 10https://gerrit.wikimedia.org/r/284349 (https://phabricator.wikimedia.org/T132631) [00:22:23] (03PS1) 10Andrew Bogott: Increase the filehandle limit for rabbitmq in labs. [puppet] - 10https://gerrit.wikimedia.org/r/285888 [00:24:35] (03PS2) 10Andrew Bogott: Increase the filehandle limit for rabbitmq in labs. [puppet] - 10https://gerrit.wikimedia.org/r/285888 [00:28:42] (03CR) 10jenkins-bot: [V: 04-1] Add new confluent module and puppetization for using confluent Kafka [puppet] - 10https://gerrit.wikimedia.org/r/284349 (https://phabricator.wikimedia.org/T132631) (owner: 10Ottomata) [00:29:16] (03CR) 10jenkins-bot: [V: 04-1] Increase the filehandle limit for rabbitmq in labs. [puppet] - 10https://gerrit.wikimedia.org/r/285888 (owner: 10Andrew Bogott) [00:29:56] !log Preparing to take phabricator offline for maintenance. [00:30:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [00:30:44] (re: https://phabricator.wikimedia.org/T128009) [00:34:13] grr [00:35:18] hm, 3b0a5eb1b6d164e4d34f3501d2c3b73256e4147c broke testing for all future patches [00:40:30] lovely. Phabricator db schema update is SLOW. This maintenance window might take a while, sorry everyone [00:40:39] (03PS3) 10Andrew Bogott: Increase the filehandle limit for rabbitmq in labs. [puppet] - 10https://gerrit.wikimedia.org/r/285888 [00:40:41] (03PS1) 10Andrew Bogott: Reformat a file permission from 01775 to 1775 [puppet] - 10https://gerrit.wikimedia.org/r/285891 [00:41:09] it's json-decoding, updating, then re-encoding every workboard column transaction. that's a lot of transactions. [00:41:42] twentyafterfour: That's the problem we have, we don't have actually a phabricator reserve instance, where someone can create a task, that phabricator is down, where he can 4SCREAM0 and make 5 PANIC 0 :D [00:42:14] Luke081515 i think that will be possible now [00:42:33] everybody loves panic :D [00:42:36] Since this update includes some support for doing that [00:42:44] RECOVERY - puppet last run on mw1146 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [00:43:14] In general it's not bad, but it's interesting, that there are people everytime how cry, when it's sheduled maintenance [00:43:20] yeah we will have a backup instance eventually [00:46:37] (03CR) 10Dzahn: [C: 031] "yea, he meant to set the sticky bit "< TimStarling> it could have the sticky bit so that other users can't delete files that you upload"" [puppet] - 10https://gerrit.wikimedia.org/r/285891 (owner: 10Andrew Bogott) [00:47:42] aaaand we're back [00:48:10] (03CR) 10Andrew Bogott: [C: 032] Reformat a file permission from 01775 to 1775 [puppet] - 10https://gerrit.wikimedia.org/r/285891 (owner: 10Andrew Bogott) [00:48:49] !log Phabricator's back online, everything seems to have gone smoothly. [00:48:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [00:55:24] twentyafterfour: Hah, https://phabricator.wikimedia.org/T85184#2245691 is interesting [00:55:28] That's an unmerged commit from Gerrit [00:55:55] Also holy crap the layout changed [00:56:33] yep [00:56:35] the whole layout [00:56:40] RoanKattouw Yep, done in https://phabricator.wikimedia.org/D217 [00:57:20] RoanKattouw: unmerged commits from gerrit are being imported as we speak. This will allow us to finally kill gitblit with fire. Lots of fire. [00:58:01] And regarding the new layout - meh. I don't really like it and I'm pretty sure a lot of people will be less than pleased by critical details being moved from the top down to the side. [00:58:07] RoanKattouw: You will now also be able to view open chages on github. [00:58:20] but I tried to convince upstream that it was a bad idea, so did a few other people. They remain unconvinced [00:58:53] Yeah, I don't know [00:59:05] It's a bit jarring at first being used to the old layout, but now the task description is front and center [00:59:18] I do have to scroll down past the description to find blocked/blocking tasks now [00:59:31] But I think this layout might be clearer overall [01:00:29] 06Operations, 10Internet-Archive, 10Wikimedia-Planet, 07Upstream: wordpress.com seems to have blocked us from fetching feeds - https://phabricator.wikimedia.org/T133818#2245705 (10Peachey88) [01:00:38] yeah it's definitely nicer to look at but now you don't get all the critical info in one place. [01:03:18] 06Operations, 10Traffic: Content purges are unreliable - https://phabricator.wikimedia.org/T133821#2245711 (10BBlack) I perhaps should've noted this in the description, but we also attempted one partial general improvement and reverted it. The improvement was to move from the singular multicast address we hav... [01:04:24] PROBLEM - MariaDB Slave Lag: m3 on db1048 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 413.14 seconds [01:08:11] "This will allow us to finally kill gitblit with fire. Lots of fire." ? [01:08:58] :) [01:11:03] twentyafterfour: viewing raw php files works now. [01:11:04] :) [01:14:55] Re imports: that's good, but there's gonna be quite a lot of notification spam [01:15:08] Yep [01:17:50] yeah ... I should have disabled herald for the import [01:23:45] 06Operations, 13Patch-For-Review: decom magnesium (was: Reinstall magnesium with jessie) - https://phabricator.wikimedia.org/T123713#2245777 (10Dzahn) a:03Dzahn [01:24:08] twentyafterfour: Im not sure if you want to enable this phabricator.serious-business [01:24:16] (03PS1) 10Dzahn: RT: add role on ununpentium, disable acme, add IPv6 [puppet] - 10https://gerrit.wikimedia.org/r/285894 (https://phabricator.wikimedia.org/T123713) [01:24:32] config/group/core/ [01:24:49] in phabricator.wikimedia.org/config/group/core/ [01:26:41] paladox: ? [01:26:54] I'm occasionally getting Aphront errors at phabricator.wikimeida.org. [01:26:59] wikimedia.org [01:27:14] Says this Allows you to remove levity and jokes from the UI. [01:27:34] Going to https://phabricator.wikimedia.org/config/ [01:27:35] paladox: it's already enabled [01:27:43] Show Peace out [01:28:09] Leah: known ... [01:28:19] Great. [01:28:26] currently importing a lot of changes into phabricator, it's slightly overloading the database [01:28:32] Me too [01:28:33] Attempt to connect to phuser@m3-master.eqiad.wmnet failed with error #2003: Can't connect to MySQL server on 'm3-master.eqiad.wmnet' (99). [01:28:44] Ah I see [01:29:11] I guess I can cut back on the job queue workers, to unload the db, but that will slow the import [01:29:45] 06Operations, 06Labs, 10Labs-Infrastructure, 10netops, 13Patch-For-Review: Allocate labs subnet in dallas - https://phabricator.wikimedia.org/T115491#2245799 (10chasemp) [01:29:48] 06Operations, 06Labs, 10Labs-Infrastructure, 10netops, 13Patch-For-Review: Allocate labs subnet in dallas - https://phabricator.wikimedia.org/T115491#1725570 (10chasemp) [01:29:52] 06Operations, 06Labs, 10Labs-Infrastructure, 10netops, 13Patch-For-Review: Allocate subnet for labs test cluster instances - https://phabricator.wikimedia.org/T115492#2245801 (10chasemp) [01:29:56] 06Operations, 06Labs, 10Labs-Infrastructure, 10netops, 13Patch-For-Review: Allocate subnet for labs test cluster instances - https://phabricator.wikimedia.org/T115492#1725583 (10chasemp) [01:30:02] 06Operations, 10Phabricator, 06Release-Engineering-Team, 10Traffic, 13Patch-For-Review: Phabricator needs to expose ssh - https://phabricator.wikimedia.org/T100519#2245804 (10chasemp) [01:30:08] 06Operations, 10Phabricator, 06Release-Engineering-Team, 10Traffic, 13Patch-For-Review: Phabricator needs to expose ssh - https://phabricator.wikimedia.org/T100519#1528619 (10chasemp) [01:30:12] 06Operations, 10ops-codfw, 13Patch-For-Review: rack & initial setup of elastic2001-2024 - https://phabricator.wikimedia.org/T111080#2245807 (10chasemp) [01:30:28] (03PS2) 10Dzahn: RT: add role on ununpentium, disable acme, add IPv6 [puppet] - 10https://gerrit.wikimedia.org/r/285894 (https://phabricator.wikimedia.org/T123713) [01:31:24] reduced from 5 to 3 taskmaster daemons, this should lighten the load on the database. [01:34:14] (03PS3) 10Dzahn: RT: add role on ununpentium, disable acme, add IPv6 [puppet] - 10https://gerrit.wikimedia.org/r/285894 (https://phabricator.wikimedia.org/T123713) [01:43:21] (03CR) 10Dzahn: [C: 032] RT: add role on ununpentium, disable acme, add IPv6 [puppet] - 10https://gerrit.wikimedia.org/r/285894 (https://phabricator.wikimedia.org/T123713) (owner: 10Dzahn) [01:48:16] twentyafterfour: It is also adding commits to the task which is goosd. [01:48:37] (03PS1) 10Dzahn: RT: fix role class name that has changed lately [puppet] - 10https://gerrit.wikimedia.org/r/285895 [01:49:28] PROBLEM - puppet last run on ununpentium is CRITICAL: CRITICAL: puppet fail [01:50:32] (03CR) 10Dzahn: [C: 032] RT: fix role class name that has changed lately [puppet] - 10https://gerrit.wikimedia.org/r/285895 (owner: 10Dzahn) [01:54:45] 06Operations, 13Patch-For-Review, 07Tracking: reduce amount of remaining Ubuntu 12.04 (precise) systems - https://phabricator.wikimedia.org/T123525#2245884 (10Dzahn) [01:54:47] 06Operations, 13Patch-For-Review: decom magnesium (was: Reinstall magnesium with jessie) - https://phabricator.wikimedia.org/T123713#2245883 (10Dzahn) 05stalled>03Open [01:56:07] PROBLEM - DPKG on ununpentium is CRITICAL: DPKG CRITICAL dpkg reports broken packages [01:56:22] well, that wouldnt be the case if puppet could just continue please [01:56:29] Notice: /Stage[main]/Packages::Links/Package[links]/ensure: ensure changed 'purged' to 'present' [01:56:33] and then it sits there [01:57:54] working on a gigantic dpkg --unpack --auto-deconfigure ... [02:01:49] all the Perl goodies.. you can do it [02:03:47] RECOVERY - DPKG on ununpentium is OK: All packages OK [02:08:17] RECOVERY - puppet last run on ununpentium is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [02:12:09] !log manually edited crontab on iridium and killed multiple instances of public_task_dump.py (the cronjob was defined as * 2 * * * instead of 0 2 * * *) [02:12:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [02:12:32] * twentyafterfour doesn't know wtf caused the crontab to get set to once per minute [02:13:57] twentyafterfour, there's a problem [02:14:21] https://phabricator.wikimedia.org/T115048#2245935 We didn't just add these commits [02:14:27] PROBLEM - Check correctness of the icinga configuration on neon is CRITICAL: Icinga configuration contains errors [02:15:08] PROBLEM - puppet last run on lvs1012 is CRITICAL: CRITICAL: Puppet has 3 failures [02:15:37] icinga is whining because a new host already has services but it doesnt know the host yet... should shut up after next run [02:16:26] PROBLEM - puppet last run on mw1146 is CRITICAL: CRITICAL: Puppet has 47 failures [02:16:43] inbox just went up 15-20 unread :/ [02:17:36] same here ;-; [02:19:29] Krenair: revi: those are old commits getting backfilled [02:19:38] with today's timestamps? [02:20:01] the action of adding them gets the timestamp when phabricator processes the action [02:20:12] nothing I can do about that now [02:22:58] (03PS1) 10Dzahn: RT: include standard in ununpentium [puppet] - 10https://gerrit.wikimedia.org/r/285896 (https://phabricator.wikimedia.org/T123713) [02:23:54] (03CR) 10Dzahn: [C: 032] RT: include standard in ununpentium [puppet] - 10https://gerrit.wikimedia.org/r/285896 (https://phabricator.wikimedia.org/T123713) (owner: 10Dzahn) [02:24:06] (03PS1) 1020after4: public_task_dump.py should run once per day, not once per minute [puppet] - 10https://gerrit.wikimedia.org/r/285897 [02:24:46] !log mwdeploy@tin sync-l10n completed (1.27.0-wmf.21) (duration: 10m 38s) [02:24:48] can an opsen look at ^ so that puppet doesn't somehow decide to reset the cron job back to minutely? I don't think iridium can handle 60 concurrent copies of that massive dump script [02:24:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [02:24:57] https://gerrit.wikimedia.org/r/285897 [02:25:24] (03PS2) 10BBlack: public_task_dump.py should run once per day, not once per minute [puppet] - 10https://gerrit.wikimedia.org/r/285897 (owner: 1020after4) [02:25:39] (03CR) 10BBlack: [C: 032 V: 032] public_task_dump.py should run once per day, not once per minute [puppet] - 10https://gerrit.wikimedia.org/r/285897 (owner: 1020after4) [02:26:28] bblack: thank you! [02:26:50] An important note: the Cron type will not reset parameters that are removed from a manifest. For example, removing a minute => 10 parameter will not reset the minute component of the associated cronjob to *. These changes must be expressed by setting the parameter to minute => absent because Puppet only manages parameters that are out of sync with manifest entries. [02:26:51] np [02:26:56] ^ could have been * before [02:27:04] but then got removed .. shrug [02:27:29] mutante: it just started running minutely today, I'm pretty sure I would have noticed it before if it ran 60 copies of that script [02:27:53] (like it started doing between 02:00 and 02:10, before I put a stop to it) [02:28:02] and there haven't been any changes to that cron definition in ages [02:28:11] (in puppet manifest anyway) [02:28:13] * twentyafterfour is puzzled [02:28:50] hmmm.. upgrades of the server software? [02:29:00] distro i mean [02:29:33] it doesnt actually say what the default value for "minute" is [02:29:38] just that it's optional [02:29:46] you could assume * or 0 [02:31:55] yeah I guess that could be it [02:32:08] I have no idea why it wouldn't have a default if it's optional. That's kinda insane [02:32:11] puppet passes it on to the cron provider [02:32:18] and that could have changed [02:32:24] it's weird [02:36:16] RECOVERY - Check correctness of the icinga configuration on neon is OK: Icinga configuration is correct [02:38:56] PROBLEM - HTTPS on ununpentium is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:IO::Socket::SSL: connect: Connection refused [02:41:37] PROBLEM - Apache HTTP on mw1146 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:41:44] !log mwdeploy@tin sync-l10n completed (1.27.0-wmf.22) (duration: 09m 24s) [02:41:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [02:42:37] PROBLEM - HHVM rendering on mw1146 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:42:47] (03PS3) 10Andrew Bogott: Mark off a block of public IPs for labtest [dns] - 10https://gerrit.wikimedia.org/r/284491 (https://phabricator.wikimedia.org/T115491) [02:45:38] RECOVERY - puppet last run on mw1146 is OK: OK: Puppet is currently enabled, last run 11 seconds ago with 0 failures [02:53:51] 06Operations, 10Traffic, 07HTTPS, 13Patch-For-Review: enable https for (ubuntu|apt|mirrors).wikimedia.org - https://phabricator.wikimedia.org/T132450#2246162 (10Chmarkine) >>! In T132450#2244769, @faidon wrote: > I'm not convinced https for that is a good idea. apt doesn't support it by default — apt-trans... [03:03:28] !log catrope@tin Synchronized php-1.27.0-wmf.21/extensions/Echo: Fix T133817 (originally scheduled for SWAT) (duration: 00m 39s) [03:03:29] T133817: Notifications hangs at nowiki - https://phabricator.wikimedia.org/T133817 [03:03:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [03:04:02] !log catrope@tin Synchronized php-1.27.0-wmf.22/extensions/Echo: Fix T133817 (originally scheduled for SWAT) (duration: 00m 34s) [03:04:03] T133817: Notifications hangs at nowiki - https://phabricator.wikimedia.org/T133817 [03:04:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [03:12:46] 06Operations, 10Phabricator, 06Release-Engineering-Team, 10Traffic, 13Patch-For-Review: Phabricator needs to expose ssh - https://phabricator.wikimedia.org/T100519#2246257 (10chasemp) [03:12:52] 06Operations, 10Phabricator, 06Release-Engineering-Team, 10Traffic, 13Patch-For-Review: Phabricator needs to expose ssh - https://phabricator.wikimedia.org/T100519#1529683 (10chasemp) [03:12:58] 06Operations, 10Phabricator, 06Release-Engineering-Team, 10Traffic, 13Patch-For-Review: Phabricator needs to expose ssh - https://phabricator.wikimedia.org/T100519#1529683 (10chasemp) [03:13:04] 06Operations, 10Phabricator, 06Release-Engineering-Team, 10Traffic, 13Patch-For-Review: Phabricator needs to expose ssh - https://phabricator.wikimedia.org/T100519#1529708 (10chasemp) [03:13:18] 06Operations, 10DBA, 10Phabricator, 13Patch-For-Review, 07WorkType-Maintenance: Phabricator creates MySQL connection spikes: Attempt to connect to phuser@m3-master.eqiad.wmnet failed with error #1040: Too many connections. - https://phabricator.wikimedia.org/T109279#1584314 (10chasemp) [03:13:22] 06Operations, 10DBA, 10Phabricator, 13Patch-For-Review, 07WorkType-Maintenance: Phabricator creates MySQL connection spikes: Attempt to connect to phuser@m3-master.eqiad.wmnet failed with error #1040: Too many connections. - https://phabricator.wikimedia.org/T109279#2246263 (10chasemp) [03:13:28] 06Operations, 10Phabricator, 06Release-Engineering-Team, 10Traffic, 13Patch-For-Review: Phabricator needs to expose ssh - https://phabricator.wikimedia.org/T100519#1620392 (10chasemp) [03:13:34] 06Operations, 10Phabricator, 06Release-Engineering-Team, 10Traffic, 13Patch-For-Review: Phabricator needs to expose ssh - https://phabricator.wikimedia.org/T100519#1620392 (10chasemp) [03:13:50] 06Operations, 10Phabricator, 06Release-Engineering-Team, 10Traffic, 13Patch-For-Review: Phabricator needs to expose ssh - https://phabricator.wikimedia.org/T100519#1645325 (10chasemp) [03:13:59] 06Operations, 10ops-eqiad, 13Patch-For-Review: Swap two elasticsearch servers in row D with an elasticsearch server in racks A3 and C5. - https://phabricator.wikimedia.org/T112559#2246273 (10chasemp) [03:14:03] 06Operations, 06Discovery, 05codfw-rollout, 03codfw-rollout-Jul-Sep-2015: Set up a CirrusSearch cluster in codfw (Dallas, Texas) - https://phabricator.wikimedia.org/T105703#2246276 (10chasemp) [03:14:09] 06Operations, 06Discovery, 05codfw-rollout, 03codfw-rollout-Jul-Sep-2015: Set up a CirrusSearch cluster in codfw (Dallas, Texas) - https://phabricator.wikimedia.org/T105703#1449703 (10chasemp) [03:14:28] 06Operations, 06Discovery, 05codfw-rollout, 03codfw-rollout-Jul-Sep-2015: Set up a CirrusSearch cluster in codfw (Dallas, Texas) - https://phabricator.wikimedia.org/T105703#1449703 (10chasemp) [03:14:30] 06Operations, 06Discovery, 05codfw-rollout, 03codfw-rollout-Jul-Sep-2015: Set up a CirrusSearch cluster in codfw (Dallas, Texas) - https://phabricator.wikimedia.org/T105703#1449703 (10chasemp) [03:15:01] 06Operations, 06Labs, 10wikitech.wikimedia.org: intermittent nutcracker failures - https://phabricator.wikimedia.org/T105131#2246288 (10chasemp) [03:16:28] 06Operations, 10Monitoring: Collect and report nutcracker statistics to Ganglia and/or Graphite - https://phabricator.wikimedia.org/T107381#1493632 (10chasemp) [03:16:30] 06Operations, 10Monitoring: Collect and report nutcracker statistics to Ganglia and/or Graphite - https://phabricator.wikimedia.org/T107381#2246305 (10chasemp) [03:18:25] 06Operations, 05Continuous-Integration-Scaling, 13Patch-For-Review: Remove hashar and dduvall root access on to be installed labnodepool1001 - https://phabricator.wikimedia.org/T95303#2246309 (10chasemp) [03:19:30] 06Operations: pc100[123] maintenance and upgrade - https://phabricator.wikimedia.org/T100301#2246310 (10jcrespo) [03:19:33] 06Operations: pc100[123] maintenance and upgrade - https://phabricator.wikimedia.org/T100301#1309978 (10jcrespo) [03:19:34] 06Operations: pc100[123] maintenance and upgrade - https://phabricator.wikimedia.org/T100301#1309978 (10jcrespo) [03:19:37] 06Operations: pc100[123] maintenance and upgrade - https://phabricator.wikimedia.org/T100301#1309978 (10jcrespo) [03:19:50] 07Blocked-on-Operations, 03Discovery-Analysis-Sprint, 13Patch-For-Review: Create rsync connector to fluorine - https://phabricator.wikimedia.org/T98383#2246314 (10Ironholds) [03:19:53] 07Blocked-on-Operations, 03Discovery-Analysis-Sprint, 13Patch-For-Review: Create rsync connector to fluorine - https://phabricator.wikimedia.org/T98383#1266051 (10Ironholds) [03:20:02] 07Blocked-on-Operations, 03Discovery-Analysis-Sprint, 13Patch-For-Review: Create rsync connector to fluorine - https://phabricator.wikimedia.org/T98383#1266051 (10Ironholds) [03:20:21] 06Operations, 06Release-Engineering-Team, 13Patch-For-Review: Move sudo permissions for deployment from modules/mediawiki/manifests/users.pp to data.yaml - https://phabricator.wikimedia.org/T97678#2246320 (10chasemp) [03:20:51] 06Operations, 10Ops-Access-Requests, 10Phabricator, 06Release-Engineering-Team: Change twentyafterfour and demon to root on phabricator (iridium) - https://phabricator.wikimedia.org/T96425#1216676 (10chasemp) [03:20:56] 06Operations, 10Ops-Access-Requests, 10Phabricator, 06Release-Engineering-Team: Change twentyafterfour and demon to root on phabricator (iridium) - https://phabricator.wikimedia.org/T96425#2246324 (10chasemp) [03:21:01] 06Operations, 10Ops-Access-Requests, 10Phabricator, 06Release-Engineering-Team: Change twentyafterfour and demon to root on phabricator (iridium) - https://phabricator.wikimedia.org/T96425#2246327 (10chasemp) [03:22:32] 06Operations, 10Ops-Access-Requests, 10Phabricator, 06Release-Engineering-Team, 13Patch-For-Review: Chad H. needs access to iridium (Phabricator host) to manage repos - https://phabricator.wikimedia.org/T92564#2246334 (10chasemp) [03:22:59] 06Operations, 10Ops-Access-Requests, 05Security: define in Puppet or remove user account - tnegrin - https://phabricator.wikimedia.org/T90932#2246338 (10chasemp) [03:23:12] 06Operations, 10Phabricator, 13Patch-For-Review: have any task put into ops-access-requests automatically generate an ops-access-review task - https://phabricator.wikimedia.org/T87467#991959 (10chasemp) [03:23:16] 06Operations, 10Phabricator, 13Patch-For-Review: have any task put into ops-access-requests automatically generate an ops-access-review task - https://phabricator.wikimedia.org/T87467#2246340 (10chasemp) [03:23:33] 06Operations, 10Phabricator, 05Security, 05WMF-NDA: The options of the Security dropdown in Phabricator need to be clear and documented - https://phabricator.wikimedia.org/T76564#2246343 (10chasemp) [03:24:19] 06Operations, 06WMF-Legal, 10Wikimedia-General-or-Unknown, 07Documentation, and 2 others: Default license for operations/puppet - https://phabricator.wikimedia.org/T67270#2246349 (10chasemp) [03:24:21] 06Operations, 06WMF-Legal, 10Wikimedia-General-or-Unknown, 07Documentation, and 2 others: Default license for operations/puppet - https://phabricator.wikimedia.org/T67270#711581 (10chasemp) [03:24:32] akosiaris: I don't have right to see status of cxserver on scb? I had it on sca :) [03:25:01] (ie service status cxserver) [03:54:11] 06Operations, 10Ops-Access-Requests, 05Security: define in Puppet or remove user account - tnegrin - https://phabricator.wikimedia.org/T90932#2246353 (10Tnegrin) Sorry Chase -- I don't actually need this now since I don't manage the analytics team anymore. Probably best to remove the access. thanks, -Toby [04:10:31] PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [1000.0] [04:13:22] PROBLEM - Esams HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [1000.0] [04:17:11] RECOVERY - Esams HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [04:17:52] RECOVERY - Text HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [04:23:43] PROBLEM - puppet last run on ms-be3001 is CRITICAL: CRITICAL: Puppet has 1 failures [04:23:51] PROBLEM - puppet last run on ms-fe3002 is CRITICAL: CRITICAL: Puppet has 1 failures [04:24:32] PROBLEM - Esams HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [1000.0] [04:25:21] PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [1000.0] [04:32:21] RECOVERY - Esams HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [04:33:01] RECOVERY - Text HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [04:44:04] 06Operations, 06Commons, 10Traffic, 10media-storage, and 2 others: Deleted files sometimes remain visible to non-privileged users if permanently linked - https://phabricator.wikimedia.org/T109331#1546384 (10Jay8g) I too can still see the file, and I get the same IP from ping. [04:47:23] AphrontConnectionQueryException [04:50:01] RECOVERY - puppet last run on ms-be3001 is OK: OK: Puppet is currently enabled, last run 43 seconds ago with 0 failures [04:50:01] RECOVERY - puppet last run on ms-fe3002 is OK: OK: Puppet is currently enabled, last run 40 seconds ago with 0 failures [04:51:35] 06Operations, 05Security: Define in Puppet or remove rogue user accounts not currently defined in admin/data.yaml - https://phabricator.wikimedia.org/T90923#2246459 (10Dzahn) [04:51:37] 06Operations, 10Ops-Access-Requests, 05Security: define in Puppet or remove user account - tnegrin - https://phabricator.wikimedia.org/T90932#2246458 (10Dzahn) 05Resolved>03Open [04:52:55] 06Operations, 10Ops-Access-Requests, 05Security: define in Puppet or remove user account - tnegrin - https://phabricator.wikimedia.org/T90932#1071319 (10Dzahn) @Tnegrin thanks for the update, reopened the ticket, we'll take care of it [04:53:09] 06Operations, 10Ops-Access-Requests, 05Security: define in Puppet or remove user account - tnegrin - https://phabricator.wikimedia.org/T90932#2246464 (10Dzahn) p:05High>03Normal [04:55:26] RECOVERY - MariaDB Slave Lag: m3 on db1048 is OK: OK slave_sql_lag Replication lag: 0.49 seconds [04:59:04] 06Operations, 10Ops-Access-Requests, 05Security: define in Puppet or remove user account - tnegrin - https://phabricator.wikimedia.org/T90932#2246485 (10Dzahn) @20after4 @chasemp @Tnegrin I think this notification about an old thing only happened right now because the releng team is importing git commits fro... [05:00:00] ^ people are getting notificatins about imported old commits [05:00:07] and then reply as if they just happened [05:00:30] but it might be good in specific cases :p [05:01:00] re: the timestamp thing on import [05:03:29] 06Operations, 10Ops-Access-Requests, 05Security: define in Puppet or remove user account - tnegrin - https://phabricator.wikimedia.org/T90932#2246487 (10Dzahn) it says "Still Importing... This commit is still importing. Changes will be visible once the import finishes." on https://phabricator.wikimedia.org... [05:10:15] (03PS1) 10Dzahn: admin: remove access for tnegrin pt1 [puppet] - 10https://gerrit.wikimedia.org/r/285898 (https://phabricator.wikimedia.org/T90932) [05:10:17] (03PS1) 10Dzahn: admin: remove access for tnegrin pt2 [puppet] - 10https://gerrit.wikimedia.org/r/285899 (https://phabricator.wikimedia.org/T90932) [05:11:15] 06Operations, 10Ops-Access-Requests, 13Patch-For-Review, 05Security: define in Puppet or remove user account - tnegrin - https://phabricator.wikimedia.org/T90932#2246494 (10Dzahn) @godog ^ could you take a look? reverse access request [05:23:20] ACKNOWLEDGEMENT - HTTPS on ununpentium is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:IO::Socket::SSL: connect: Connection refused daniel_zahn ongoing migration [05:27:56] !log krypton remove RT packages, remnants from testing [05:28:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [05:34:07] !log mw1146 - hhvm restart [05:34:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [05:34:29] RECOVERY - Apache HTTP on mw1146 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 627 bytes in 0.168 second response time [05:36:10] RECOVERY - HHVM rendering on mw1146 is OK: HTTP OK: HTTP/1.1 200 OK - 66737 bytes in 0.204 second response time [05:37:44] !log lvs1012 - puppet fail, tries to upgrade tcpdump package and cannot be authenticated [05:37:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [05:44:04] 06Operations: lvs1012 - puppet fail, tcpdump package cannot be authenticated - https://phabricator.wikimedia.org/T133832#2246512 (10Dzahn) [05:44:23] 06Operations: lvs1012 - puppet fail, tcpdump package cannot be authenticated - https://phabricator.wikimedia.org/T133832#2246524 (10Dzahn) [05:45:23] ACKNOWLEDGEMENT - puppet last run on lvs1012 is CRITICAL: CRITICAL: Puppet has 3 failures daniel_zahn https://phabricator.wikimedia.org/T133832 [05:45:53] afk, good night [06:30:18] 06Operations, 10OCG-General, 06Services, 13Patch-For-Review: Implement flag to tell an OCG machine not to take new tasks from the redis task queue - https://phabricator.wikimedia.org/T120077#2246887 (10Joe) Step 1 is not a puppet commit anymore; I guess even the flag to put on the FS could be done outside... [06:31:09] PROBLEM - puppet last run on mw1170 is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:19] PROBLEM - puppet last run on eventlog2001 is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:38] PROBLEM - puppet last run on cp3048 is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:39] PROBLEM - puppet last run on mw1172 is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:49] PROBLEM - puppet last run on cp3007 is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:49] PROBLEM - puppet last run on mw2018 is CRITICAL: CRITICAL: Puppet has 1 failures [06:32:18] PROBLEM - puppet last run on scb2001 is CRITICAL: CRITICAL: Puppet has 1 failures [06:32:39] PROBLEM - puppet last run on mw2207 is CRITICAL: CRITICAL: Puppet has 2 failures [06:42:49] PROBLEM - Redis status tcp_6479 on rdb2006 is CRITICAL: CRITICAL ERROR - Redis Library - can not ping 10.192.48.44 on port 6479 [06:44:39] RECOVERY - Redis status tcp_6479 on rdb2006 is OK: OK: REDIS on 10.192.48.44:6479 has 1 databases (db0) with 5085707 keys - replication_delay is 0 [06:48:10] (03PS3) 10Muehlenhoff: udp2log: Move ferm rules into the role [puppet] - 10https://gerrit.wikimedia.org/r/285375 [06:56:29] RECOVERY - puppet last run on mw2207 is OK: OK: Puppet is currently enabled, last run 8 seconds ago with 0 failures [06:56:49] RECOVERY - puppet last run on mw1170 is OK: OK: Puppet is currently enabled, last run 6 seconds ago with 0 failures [06:57:08] RECOVERY - puppet last run on eventlog2001 is OK: OK: Puppet is currently enabled, last run 17 seconds ago with 0 failures [06:57:20] RECOVERY - puppet last run on cp3048 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:57:38] RECOVERY - puppet last run on cp3007 is OK: OK: Puppet is currently enabled, last run 36 seconds ago with 0 failures [06:57:39] RECOVERY - puppet last run on mw2018 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:57:59] RECOVERY - puppet last run on scb2001 is OK: OK: Puppet is currently enabled, last run 6 seconds ago with 0 failures [06:59:18] RECOVERY - puppet last run on mw1172 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [07:05:12] (03CR) 10Muehlenhoff: [C: 032 V: 032] udp2log: Move ferm rules into the role [puppet] - 10https://gerrit.wikimedia.org/r/285375 (owner: 10Muehlenhoff) [07:07:16] (03CR) 10WMDE-leszek: "Strange as it may seem I believe it is expected. The only Labs instance where Phragile is currently running uses "phragile" role (ie. not " [puppet] - 10https://gerrit.wikimedia.org/r/285333 (owner: 10Dzahn) [07:18:28] 06Operations, 10OCG-General, 05codfw-rollout: Document eqiad/codfw transition plan for OCG - https://phabricator.wikimedia.org/T133164#2246994 (10Joe) >>! In T133164#2243258, @Joe wrote: > I actually have a question for @cscott > > how does mediawiki learn which backend to contact? I don't see any reference... [07:18:53] 06Operations, 13Patch-For-Review: install font packages on all appservers, not just imagescalers (was: Install fonts-wqy-zenhei on all mediawiki app servers) - https://phabricator.wikimedia.org/T84777#931171 (10MoritzMuehlenhoff) According to https://www.mediawiki.org/wiki/Extension:EasyTimeline Erik Zachte is... [07:23:10] (03PS2) 10ArielGlenn: fix string comparison in dumpcirrussearch for old dump cleanups [puppet] - 10https://gerrit.wikimedia.org/r/285693 [07:24:37] (03CR) 10ArielGlenn: [C: 032] fix string comparison in dumpcirrussearch for old dump cleanups [puppet] - 10https://gerrit.wikimedia.org/r/285693 (owner: 10ArielGlenn) [07:25:50] 06Operations, 10OCG-General, 05codfw-rollout: Document eqiad/codfw transition plan for OCG - https://phabricator.wikimedia.org/T133164#2247003 (10mobrovac) >>! In T133164#2246994, @Joe wrote: > Answering myself: from https://github.com/wikimedia/mediawiki-extensions-Collection/blob/master/Collection.body.php... [07:25:57] 06Operations, 10Phabricator: Database errors using phabricator - https://phabricator.wikimedia.org/T133826#2246398 (10Paladox) >>! In T133826#2246628, @Peachey88 wrote: > @mmodell is currently importing Gerrit repositories into differential I believe. He actually included a change that allows us to import all... [07:42:09] PROBLEM - PHD should be supervising processes on iridium is CRITICAL: PROCS CRITICAL: 0 processes with UID = 997 (phd) [07:44:00] RECOVERY - PHD should be supervising processes on iridium is OK: PROCS OK: 15 processes with UID = 997 (phd) [07:46:18] (03PS1) 10Muehlenhoff: Enable base::firewall on stat1002 [puppet] - 10https://gerrit.wikimedia.org/r/285904 [07:49:37] 06Operations, 10Phabricator: Database errors using phabricator - https://phabricator.wikimedia.org/T133826#2247077 (10mmodell) I've reduced the number of phabricator task queue workers, this should eliminate the db errors at the expense of increasing the time to complete the import. [07:50:45] !log reduced the number of phabricator worker processes to hopefully stop exhausting mysql connections. [07:50:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [07:51:13] !log applied a hotfix to phabricator repository import job so that autoclose will not apply to unmerged refs/changes [07:51:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [08:09:31] 06Operations, 10Traffic, 07HTTPS, 13Patch-For-Review: enable https for (ubuntu|apt|mirrors).wikimedia.org - https://phabricator.wikimedia.org/T132450#2247110 (10faidon) See: https://packages.debian.org/sid/apt-transport-https https://launchpad.net/ubuntu/trusty/+package/apt-transport-https Also check out... [08:12:46] !log restarting kafka on kafka{1012,1014,1022,1020,2001,2002} for Java upgrades. Will probably trigger some EventLogging alarms due to a bug (T133779) [08:12:46] T133779: Event Logging doesn't handle kafka nodes restart cleanly - https://phabricator.wikimedia.org/T133779 [08:12:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [08:13:02] good morning [08:13:06] o/ [08:14:53] (03CR) 10Alexandros Kosiaris: [C: 032] palladium: add v6 address [puppet] - 10https://gerrit.wikimedia.org/r/285665 (owner: 10Giuseppe Lavagetto) [08:14:59] (03PS2) 10Alexandros Kosiaris: palladium: add v6 address [puppet] - 10https://gerrit.wikimedia.org/r/285665 (owner: 10Giuseppe Lavagetto) [08:15:06] (03CR) 10Alexandros Kosiaris: [C: 032 V: 032] palladium: add v6 address [puppet] - 10https://gerrit.wikimedia.org/r/285665 (owner: 10Giuseppe Lavagetto) [08:17:46] 06Operations: lvs1012 - puppet fail, tcpdump package cannot be authenticated - https://phabricator.wikimedia.org/T133832#2246512 (10MoritzMuehlenhoff) That's indirectly caused by some kind of connection issue, "apt-get update" stalls (while it works fine on e.g. lvs1011). [08:26:55] 06Operations, 06Discovery, 06Discovery-Search-Backlog, 07Elasticsearch: Improve Elasticsearch icinga altering - https://phabricator.wikimedia.org/T133844#2247220 (10Gehel) [08:35:19] twentyafterfour: around? [08:35:34] 06Operations, 10Wikimedia-SVG-rendering: Install Noto CJK (Source Han Sans) font family for SVG rendering - https://phabricator.wikimedia.org/T123223#2247320 (10MoritzMuehlenhoff) a:03MoritzMuehlenhoff [08:35:43] Did phabricator have an upgrade? [08:37:06] addshore: yes [08:37:15] (03PS1) 10Jcrespo: Add grants to pdns mysql for localhost [puppet] - 10https://gerrit.wikimedia.org/r/285907 (https://phabricator.wikimedia.org/T128737) [08:37:17] okay! [08:37:40] any details about it? ie, to what version? [08:41:17] (03PS2) 10Giuseppe Lavagetto: tcpircbot: allow sending messages from palladium [puppet] - 10https://gerrit.wikimedia.org/r/285653 [08:42:07] (03PS3) 10Giuseppe Lavagetto: tcpircbot: allow sending messages from palladium [puppet] - 10https://gerrit.wikimedia.org/r/285653 [08:49:13] (03CR) 10Giuseppe Lavagetto: [C: 032] tcpircbot: allow sending messages from palladium [puppet] - 10https://gerrit.wikimedia.org/r/285653 (owner: 10Giuseppe Lavagetto) [08:49:41] 06Operations, 10netops: analytics hosts frequently tripping 'port utilization threshold' librenms alerts - https://phabricator.wikimedia.org/T133852#2247332 (10fgiunchedi) [08:51:34] 06Operations, 10netops: analytics hosts frequently tripping 'port utilization threshold' librenms alerts - https://phabricator.wikimedia.org/T133852#2247344 (10fgiunchedi) p:05Triage>03Normal [08:55:20] (03PS1) 10ArielGlenn: add new third party mirror for "other" datasets [puppet] - 10https://gerrit.wikimedia.org/r/285909 [08:56:44] (03CR) 10ArielGlenn: [C: 032] add new third party mirror for "other" datasets [puppet] - 10https://gerrit.wikimedia.org/r/285909 (owner: 10ArielGlenn) [08:57:48] !log oblivian@palladium conftool action : set/weight=12; selector: name=mw2018.codfw.wmnet [08:57:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [08:57:56] <_joe_> it works :) [08:58:23] !log oblivian@palladium conftool action : set/weight=10; selector: name=mw2018.codfw.wmnet [08:58:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [08:59:15] !log starting rolling restart of elasticsearch cluster in eqiad (T110236) [08:59:16] T110236: Use unicast instead of multicast for node communication - https://phabricator.wikimedia.org/T110236 [08:59:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [08:59:31] (03PS5) 10Volans: MariaDB: Load and enable semi-sync replication [puppet] - 10https://gerrit.wikimedia.org/r/285649 (https://phabricator.wikimedia.org/T133333) [09:00:37] (03PS2) 10Jcrespo: Add grants to pdns mysql for localhost [puppet] - 10https://gerrit.wikimedia.org/r/285907 (https://phabricator.wikimedia.org/T128737) [09:00:45] !log restarting elasticsearch server elastic1001.eqiad.wmnet (T110236) [09:00:46] T110236: Use unicast instead of multicast for node communication - https://phabricator.wikimedia.org/T110236 [09:00:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [09:01:10] (03CR) 10Filippo Giunchedi: "minor nit, LGTM otherwise" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/284078 (https://phabricator.wikimedia.org/T126629) (owner: 10Eevans) [09:01:49] (03CR) 10Elukey: [C: 031] Enable base::firewall on stat1002 [puppet] - 10https://gerrit.wikimedia.org/r/285904 (owner: 10Muehlenhoff) [09:02:21] (03PS3) 10Jcrespo: Add grants to pdns mysql for localhost [puppet] - 10https://gerrit.wikimedia.org/r/285907 (https://phabricator.wikimedia.org/T128737) [09:03:49] !log remove obsolete mysql 5.5 installations from mw1022, mw1023, mw1024, mw1025, mw1114 and mw1163 [09:03:52] (03PS1) 10ArielGlenn: dataset rsync clients list in hiera requires a hostname so remove IP [puppet] - 10https://gerrit.wikimedia.org/r/285910 [09:03:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [09:04:55] (03CR) 10ArielGlenn: [C: 032] dataset rsync clients list in hiera requires a hostname so remove IP [puppet] - 10https://gerrit.wikimedia.org/r/285910 (owner: 10ArielGlenn) [09:04:57] (03PS6) 10Volans: MariaDB: Load and enable semi-sync replication [puppet] - 10https://gerrit.wikimedia.org/r/285649 (https://phabricator.wikimedia.org/T133333) [09:06:44] (03PS3) 10Elukey: Add conf200[123] Zookeeper Service nodes in codfw. [puppet] - 10https://gerrit.wikimedia.org/r/285393 (https://phabricator.wikimedia.org/T131959) [09:07:01] PROBLEM - aqs endpoints health on aqs1001 is CRITICAL: Generic error: Generic connection error: HTTPConnectionPool(host=10.64.0.123, port=7232): Max retries exceeded with url: /analytics.wikimedia.org/v1/?spec (Caused by ProtocolError(Connection aborted., error(111, Connection refused))) [09:07:21] PROBLEM - AQS root url on aqs1001 is CRITICAL: Connection refused [09:07:48] Sorry ops team, AQS errors are me deploying [09:09:12] let's see if we can de-pool it [09:10:14] !log elukey@palladium conftool action : set/pooled=no; selector: aqs1001.eqiad.wmnet [09:10:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [09:10:27] (03PS4) 10Jcrespo: Add grants to pdns mysql for localhost [puppet] - 10https://gerrit.wikimedia.org/r/285907 (https://phabricator.wikimedia.org/T128737) [09:10:41] _joe_ WOW nice! [09:10:43] ---^ [09:10:59] (03PS7) 10Volans: MariaDB: Load and enable semi-sync replication [puppet] - 10https://gerrit.wikimedia.org/r/285649 (https://phabricator.wikimedia.org/T133333) [09:13:08] (03CR) 10Jcrespo: [C: 032] Add grants to pdns mysql for localhost [puppet] - 10https://gerrit.wikimedia.org/r/285907 (https://phabricator.wikimedia.org/T128737) (owner: 10Jcrespo) [09:16:32] (03PS1) 10Muehlenhoff: Add salt grain for RT and wire up in debdeploy [puppet] - 10https://gerrit.wikimedia.org/r/285913 [09:18:31] addshore: https://phabricator.wikimedia.org/T128009 [09:18:49] many thanks! [09:21:54] 06Operations, 10ops-codfw: rack/setup/deploy restbase200[7-9] - https://phabricator.wikimedia.org/T132976#2247450 (10fgiunchedi) @papaul those would be the ssds we installed and then removed in T127333 [09:23:01] 06Operations, 10Phabricator: Database errors using phabricator - https://phabricator.wikimedia.org/T133826#2247456 (10mmodell) 05Open>03Resolved a:03mmodell Seems to be stable now. [09:23:57] !log removing unused mysql-server-5.5 from holmium (keeping database just in case) T128737 [09:23:58] T128737: Move labs pdns database off of m5-master - https://phabricator.wikimedia.org/T128737 [09:24:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [09:24:36] 06Operations, 06Services, 13Patch-For-Review, 07RESTBase-architecture: Separate /var on restbase100x - https://phabricator.wikimedia.org/T113714#2247463 (10fgiunchedi) [09:24:38] 07Blocked-on-Operations, 06Operations, 10RESTBase, 10hardware-requests: Expand SSD space in Cassandra cluster - https://phabricator.wikimedia.org/T121575#2247461 (10fgiunchedi) 05Open>03Resolved that's correct @robh, resolving [09:26:38] (03PS2) 10ArielGlenn: fix up rsync of kiwix openzim files to dataset host [puppet] - 10https://gerrit.wikimedia.org/r/285689 [09:26:38] 06Operations, 10CirrusSearch, 06Discovery, 06Discovery-Search-Backlog, and 2 others: upgrade HHVM to 3.12.1 on terbium (Elastica: missing curl_init_pooled method.) - https://phabricator.wikimedia.org/T132751#2247468 (10Gehel) HHVM is upgraded to 3.12.1: ``` gehel@terbium:~$ hhvm --version HipHop VM 3.12.1... [09:28:10] 06Operations, 06Services, 13Patch-For-Review, 07RESTBase-architecture: Separate /var on restbase - https://phabricator.wikimedia.org/T113714#2247471 (10fgiunchedi) [09:28:47] 06Operations, 06Services, 13Patch-For-Review, 07RESTBase-architecture: Separate /var on restbase - https://phabricator.wikimedia.org/T113714#1674039 (10fgiunchedi) eqiad is done, codfw has restbase200[356] to be converted to multi-instance, which will resolve this too [09:34:54] 06Operations, 06Discovery, 10Wikidata, 10Wikidata-Query-Service, 05codfw-rollout: WDQS stopped updating during datacenter switch - https://phabricator.wikimedia.org/T133046#2247481 (10Gehel) Nothing to do here on the WDQS side. Closing it. [09:41:07] (03CR) 10Filippo Giunchedi: [C: 031] admin: remove access for tnegrin pt1 [puppet] - 10https://gerrit.wikimedia.org/r/285898 (https://phabricator.wikimedia.org/T90932) (owner: 10Dzahn) [09:41:19] (03CR) 10Filippo Giunchedi: [C: 031] admin: remove access for tnegrin pt2 [puppet] - 10https://gerrit.wikimedia.org/r/285899 (https://phabricator.wikimedia.org/T90932) (owner: 10Dzahn) [09:46:10] !log restarting elasticsearch server elastic1002.eqiad.wmnet (T110236) [09:46:10] T110236: Use unicast instead of multicast for node communication - https://phabricator.wikimedia.org/T110236 [09:46:13] 06Operations, 06Discovery, 10Traffic, 10Wikidata, 10Wikidata-Query-Service: Wikidata Query Service REST endpoint returns truncated results - https://phabricator.wikimedia.org/T133490#2247526 (10fgiunchedi) p:05Triage>03Normal [09:46:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [09:49:38] 06Operations, 10DBA, 05codfw-rollout, 03codfw-rollout-Jan-Mar-2016: Improve documentation about database switchover - https://phabricator.wikimedia.org/T129236#2247534 (10jcrespo) 05Open>03Resolved a:03jcrespo I think the documentation: https://wikitech.wikimedia.org/wiki/Switch_Datacenter is pretty... [09:51:28] 06Operations, 13Patch-For-Review, 05codfw-rollout, 03codfw-rollout-Jan-Mar-2016: Figure out and document the datacenter switchover process - https://phabricator.wikimedia.org/T124670#1962060 (10jcrespo) "we should keep up-to-date going forward" is not really a finite actionable task, I would consider this... [09:51:35] guy, want to close https://phabricator.wikimedia.org/T124670 [09:52:18] *you people [09:52:51] RECOVERY - aqs endpoints health on aqs1001 is OK: All endpoints are healthy [09:53:08] 06Operations, 13Patch-For-Review, 05codfw-rollout, 03codfw-rollout-Jan-Mar-2016: Figure out and document the datacenter switchover process - https://phabricator.wikimedia.org/T124670#2247546 (10jcrespo) Process documented on: https://phabricator.wikimedia.org/T124670 [09:53:10] RECOVERY - AQS root url on aqs1001 is OK: HTTP OK: HTTP/1.1 200 - 727 bytes in 0.008 second response time [09:53:15] <_joe_> jynus: go on :) [09:53:33] <_joe_> btw the new phab interface is, well, different [09:53:47] 06Operations, 13Patch-For-Review, 05codfw-rollout, 03codfw-rollout-Jan-Mar-2016: Figure out and document the datacenter switchover process - https://phabricator.wikimedia.org/T124670#2247547 (10jcrespo) 05Open>03Resolved a:03jcrespo [09:55:51] also T127974 (probably resolved), T126632 (remove tag), T122134 (remove tag) [09:55:51] T122134: url-downloader should be set up more redundantly - https://phabricator.wikimedia.org/T122134 [09:55:52] T126632: Scap should restart job runners to pick up new config - https://phabricator.wikimedia.org/T126632 [09:55:52] T127974: Services DC switch-over checklist / tracking task - https://phabricator.wikimedia.org/T127974 [09:55:59] 06Operations, 10Traffic, 10Continuous-Integration-Infrastructure (phase-out-gallium): Move gallium to an internal host? - https://phabricator.wikimedia.org/T133150#2247551 (10fgiunchedi) 05Open>03stalled p:05Triage>03Normal [09:56:17] 06Operations, 10Traffic, 10Continuous-Integration-Infrastructure (phase-out-gallium): Move gallium to an internal host? - https://phabricator.wikimedia.org/T133150#2223557 (10fgiunchedi) [10:03:09] (03CR) 10Jcrespo: [C: 031] "Looks good, but let's test it just after deploy." [puppet] - 10https://gerrit.wikimedia.org/r/285649 (https://phabricator.wikimedia.org/T133333) (owner: 10Volans) [10:06:15] 06Operations, 10CirrusSearch, 06Discovery, 06Discovery-Search-Backlog, and 2 others: upgrade HHVM to 3.12.1 on terbium (Elastica: missing curl_init_pooled method.) - https://phabricator.wikimedia.org/T132751#2247597 (10hashar) terbium has a bunch of maintenance script runs by people or via a cron. Some re... [10:12:33] 06Operations, 10CirrusSearch, 06Discovery, 06Discovery-Search-Backlog, and 2 others: upgrade HHVM to 3.12.1 on terbium (Elastica: missing curl_init_pooled method.) - https://phabricator.wikimedia.org/T132751#2247645 (10hashar) From Logstash the messages are currently emitted for `uzwiki` and `uzwikitionary... [10:14:51] 07Blocked-on-Operations, 05Continuous-Integration-Scaling, 13Patch-For-Review, 07WorkType-NewFunctionality, 03releng-201516-q4: Attempt to provide a Trusty image for Nodepool - https://phabricator.wikimedia.org/T133203#2247667 (10MoritzMuehlenhoff) [10:18:20] mh tin/mira have unmerged changes for mediawiki_config, specifically looks like https://gerrit.wikimedia.org/r/#/c/285765/ never got deployed, matt_flaschen ? [10:19:35] (03PS3) 10Alexandros Kosiaris: Add yurik to deploy-service group [puppet] - 10https://gerrit.wikimedia.org/r/285706 (owner: 10Yurik) [10:19:41] (03CR) 10Alexandros Kosiaris: [C: 032 V: 032] Add yurik to deploy-service group [puppet] - 10https://gerrit.wikimedia.org/r/285706 (owner: 10Yurik) [10:21:16] 06Operations, 10CirrusSearch, 06Discovery, 06Discovery-Search-Backlog, and 2 others: upgrade HHVM to 3.12.1 on terbium (Elastica: missing curl_init_pooled method.) - https://phabricator.wikimedia.org/T132751#2247677 (10hashar) mwscript got hardcoded to Zend via 8f8e7dbdd834066504e59edfc4881bb98f76072a / ht... [10:21:38] (03PS1) 10Filippo Giunchedi: monitoring: report reference name on uncommitted changes [puppet] - 10https://gerrit.wikimedia.org/r/285924 [10:24:27] 06Operations, 10CirrusSearch, 06Discovery, 06Discovery-Search-Backlog, and 3 others: ElasticSearch Not enough active copies to meet write consistency - https://phabricator.wikimedia.org/T133784#2247687 (10hashar) The log spam is gone: {F3941716 size=full} Thank you! [10:25:32] 06Operations, 10Traffic, 13Patch-For-Review: confctl: improve/upgrade --tags/--find - https://phabricator.wikimedia.org/T128199#2247688 (10Joe) 05Open>03Resolved [10:28:28] looks harmless to me to fetch/merge on tin for gerrit 285765 [10:29:20] RECOVERY - Unmerged changes on repository mediawiki_config on tin is OK: No changes to merge. [10:30:32] RECOVERY - Unmerged changes on repository mediawiki_config on mira is OK: No changes to merge. [10:31:07] !log installing PHP updates for jessie [10:31:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [10:32:21] !log Set new email for global user "Sebschlicht" per https://meta.wikimedia.org/w/index.php?oldid=15564713#Sebschlicht2.40global and private communication [10:32:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [10:38:03] (03CR) 10Mobrovac: "> hieradata/labs/host/deployment-restbase02.yaml" [puppet] - 10https://gerrit.wikimedia.org/r/284078 (https://phabricator.wikimedia.org/T126629) (owner: 10Eevans) [10:39:47] !log elukey@palladium conftool action : set/pooled=yes; selector: aqs1001.eqiad.wmnet [10:39:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [10:41:42] !log running update table on eventlogging database on the master (db1046) T108856 [10:41:42] T108856: Set up bucketization of editCount fields {tick} - https://phabricator.wikimedia.org/T108856 [10:41:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [10:43:51] 06Operations, 10DBA, 13Patch-For-Review: reimage or decom db servers on precise - https://phabricator.wikimedia.org/T125028#2247719 (10jcrespo) [10:44:36] (03PS1) 10Yuvipanda: tools: Remove all webservice related code [puppet] - 10https://gerrit.wikimedia.org/r/285926 (https://phabricator.wikimedia.org/T98440) [10:46:17] 06Operations: lvs1012 - puppet fail, tcpdump package cannot be authenticated - https://phabricator.wikimedia.org/T133832#2247729 (10fgiunchedi) p:05Triage>03Normal [10:47:12] 06Operations, 06Discovery, 06Discovery-Search-Backlog, 07Elasticsearch: Improve Elasticsearch icinga altering - https://phabricator.wikimedia.org/T133844#2247730 (10fgiunchedi) p:05Triage>03Normal [10:47:38] 06Operations, 10OCG-General, 05codfw-rollout: Document eqiad/codfw transition plan for OCG - https://phabricator.wikimedia.org/T133164#2247731 (10fgiunchedi) p:05Triage>03Normal [10:50:10] 06Operations, 10Wikimedia-Apache-configuration, 07Varnish: Data passed to HHVM ($_SERVER variables) is a mixed bag of already-decoded and non-decoded nonsense - https://phabricator.wikimedia.org/T132629#2247749 (10fgiunchedi) p:05Triage>03Normal [10:50:27] !log stopping and restarting db1038 for backup and upgrade T125028 [10:50:27] T125028: reimage or decom db servers on precise - https://phabricator.wikimedia.org/T125028 [10:50:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [10:51:21] 06Operations, 10OCG-General, 05codfw-rollout: Use FQDNs instead of hostnames in the download urls sent to Mediawiki - https://phabricator.wikimedia.org/T133864#2247753 (10Joe) [10:51:59] ^wow [10:53:02] (03PS8) 10Volans: MariaDB: Load and enable semi-sync replication [puppet] - 10https://gerrit.wikimedia.org/r/285649 (https://phabricator.wikimedia.org/T133333) [10:54:58] (03CR) 10Filippo Giunchedi: [C: 031] "LGTM, puppet compiler on a sample of hosts https://puppet-compiler.wmflabs.org/2605/" [puppet] - 10https://gerrit.wikimedia.org/r/285367 (https://phabricator.wikimedia.org/T126310) (owner: 10Giuseppe Lavagetto) [10:55:30] PROBLEM - statsv process on hafnium is CRITICAL: PROCS CRITICAL: 0 processes with command name python, args statsv [11:01:20] RECOVERY - statsv process on hafnium is OK: PROCS OK: 13 processes with command name python, args statsv [11:03:28] !log backing up db1038 data to dbstore1002 [11:03:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [11:03:40] PROBLEM - MariaDB Slave IO: s3 on dbstore1001 is CRITICAL: CRITICAL slave_io_state Slave_IO_Running: No, Errno: 2003, Errmsg: error reconnecting to master repl@db1038.eqiad.wmnet:3306 - retry-time: 60 retries: 86400 message: Cant connect to MySQL server on db1038.eqiad.wmnet (111 Connection refused) [11:04:30] ^normal, expected, dbstore1001's master is db1038 [11:04:35] acked [11:06:12] 06Operations, 10OCG-General, 05codfw-rollout: Document eqiad/codfw transition plan for OCG - https://phabricator.wikimedia.org/T133164#2247786 (10Joe) >>! In T133164#2247003, @mobrovac wrote: >>>! In T133164#2246994, @Joe wrote: >> Answering myself: from https://github.com/wikimedia/mediawiki-extensions-Coll... [11:08:00] (03PS1) 10Dereckson: GoogleNewsSitemap configuration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/285927 (https://phabricator.wikimedia.org/T39608) [11:12:02] (03PS1) 10Jcrespo: Config changes for db1038 (old s3 master) reimaging [puppet] - 10https://gerrit.wikimedia.org/r/285928 (https://phabricator.wikimedia.org/T125028) [11:12:46] 06Operations, 10CirrusSearch, 06Discovery, 06Discovery-Search-Backlog, and 3 others: upgrade HHVM to 3.12.1 on terbium (Elastica: missing curl_init_pooled method.) - https://phabricator.wikimedia.org/T132751#2247813 (10Gehel) [11:12:49] 06Operations, 10CirrusSearch, 06Discovery, 06Discovery-Search-Backlog, and 3 others: upgrade HHVM to 3.12.1 on terbium (Elastica: missing curl_init_pooled method.) - https://phabricator.wikimedia.org/T132751#2209239 (10Gehel) [11:13:00] thanks Dereckson :D (for claiming the urlshortener ticket) [11:13:07] (03PS4) 10Elukey: Add conf200[123] Zookeeper Service nodes in codfw. [puppet] - 10https://gerrit.wikimedia.org/r/285393 (https://phabricator.wikimedia.org/T131959) [11:13:56] 06Operations, 06Discovery, 10Wikidata, 10Wikidata-Query-Service, and 2 others: Reinstall and data reload of WDQS servers - https://phabricator.wikimedia.org/T133566#2247815 (10Gehel) [11:14:00] 06Operations, 06Discovery, 10Wikidata, 10Wikidata-Query-Service, and 2 others: Reinstall and data reload of WDQS servers - https://phabricator.wikimedia.org/T133566#2236004 (10Gehel) [11:14:14] !log restarting elasticsearch server elastic1003.eqiad.wmnet (T110236) [11:14:15] T110236: Use unicast instead of multicast for node communication - https://phabricator.wikimedia.org/T110236 [11:14:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [11:14:21] !log upgraded varnish on cp1008 to 3.0.7 (except one patch) [11:14:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [11:14:28] You're welcome. [11:15:23] (03PS5) 10Elukey: Add conf200[123] Zookeeper Service nodes in codfw. [puppet] - 10https://gerrit.wikimedia.org/r/285393 (https://phabricator.wikimedia.org/T131959) [11:17:51] 06Operations, 10CirrusSearch, 06Discovery, 06Discovery-Search-Backlog, and 3 others: upgrade HHVM to 3.12.1 on terbium (Elastica: missing curl_init_pooled method.) - https://phabricator.wikimedia.org/T132751#2247837 (10Gehel) a:05Gehel>03None [11:18:21] 06Operations, 10CirrusSearch, 06Discovery, 06Discovery-Search-Backlog, and 3 others: upgrade HHVM to 3.12.1 on terbium (Elastica: missing curl_init_pooled method.) - https://phabricator.wikimedia.org/T132751#2209239 (10Gehel) Up for grab, this is probably something @EBernhardson or @dcausse can have a look... [11:18:42] (03PS2) 10Yuvipanda: tools: Remove all webservice related code [puppet] - 10https://gerrit.wikimedia.org/r/285926 (https://phabricator.wikimedia.org/T98440) [11:19:44] !log cleaning up some space on puppet-compiler host [11:19:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [11:20:15] volans: I was about to do the same, thanks :) [11:20:23] are you removing stuff from /mnt/jenkins-workspace/puppet-compiler/output ? [11:20:28] (03PS3) 10Yuvipanda: tools: Remove all webservice related code [puppet] - 10https://gerrit.wikimedia.org/r/285926 (https://phabricator.wikimedia.org/T98440) [11:21:12] elukey: yes, aborted jobs or mines [11:21:14] to be sure [11:21:15] (03PS4) 10Yuvipanda: tools: Remove all webservice related code [puppet] - 10https://gerrit.wikimedia.org/r/285926 (https://phabricator.wikimedia.org/T98440) [11:21:32] I'm getting the bigger ones, some space already free, so you can run your job if needed [11:25:22] !log uploaded varnish 3.0.6plus-wm9 to carbon for jessie-wikimedia [11:25:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [11:26:38] 06Operations, 10Traffic, 10Wiki-Loves-Monuments-General, 07HTTPS: configure https for www.wikilovesmonuments.org - https://phabricator.wikimedia.org/T118388#2247901 (10SindyM3) @JanZerebecki Via Let's Encrypt, it is not possible to have all the necessary certificates. I have contacted the Wikimedia DE. W... [11:28:44] Why isn't it possible? [11:29:40] (03PS5) 10Yuvipanda: tools: Remove all webservice related code [puppet] - 10https://gerrit.wikimedia.org/r/285926 (https://phabricator.wikimedia.org/T98440) [11:29:54] (03CR) 10Yuvipanda: [C: 032 V: 032] tools: Remove all webservice related code [puppet] - 10https://gerrit.wikimedia.org/r/285926 (https://phabricator.wikimedia.org/T98440) (owner: 10Yuvipanda) [11:36:20] 06Operations, 06Labs, 10Labs-Infrastructure: Some labs instances IP have multiple PTR entries in DNS - https://phabricator.wikimedia.org/T115194#2247930 (10hashar) I wrote a stupid resolver for the A records: {P2969} Running it right now from deployment-tin and range `70000-85000`. [11:45:24] (03PS1) 10Dereckson: Apache Redirects for w.wiki [puppet] - 10https://gerrit.wikimedia.org/r/285932 (https://phabricator.wikimedia.org/T108557) [11:47:28] (03CR) 10Dereckson: "This change will edit modules/mediawiki/files/apache/sites/redirects.conf like this:" [puppet] - 10https://gerrit.wikimedia.org/r/285932 (https://phabricator.wikimedia.org/T108557) (owner: 10Dereckson) [11:49:29] (03CR) 10Volans: "@Jcrespo, here the compiler runs:" [puppet] - 10https://gerrit.wikimedia.org/r/285649 (https://phabricator.wikimedia.org/T133333) (owner: 10Volans) [11:50:46] (03PS2) 10Dereckson: Apache Redirects for w.wiki [puppet] - 10https://gerrit.wikimedia.org/r/285932 (https://phabricator.wikimedia.org/T133485) [11:50:52] 06Operations, 06Labs, 10Labs-Infrastructure: Some labs instances IP have multiple PTR entries in DNS - https://phabricator.wikimedia.org/T115194#2247941 (10hashar) Out of 15000 A entries, only one leaked: ``` $ python blam.py --delay 0.1 70000-85000 Start: ci-jessie-wikimedia-70000.contintcloud.eqiad.wmflabs... [11:50:54] (03PS3) 10ArielGlenn: fix up rsync of kiwix openzim files to dataset host [puppet] - 10https://gerrit.wikimedia.org/r/285689 [11:51:47] PROBLEM - puppet last run on install2001 is CRITICAL: CRITICAL: puppet fail [11:53:23] (03PS4) 10ArielGlenn: fix up rsync of kiwix openzim files to dataset host [puppet] - 10https://gerrit.wikimedia.org/r/285689 [11:54:26] (03CR) 10ArielGlenn: [C: 032] fix up rsync of kiwix openzim files to dataset host [puppet] - 10https://gerrit.wikimedia.org/r/285689 (owner: 10ArielGlenn) [11:59:32] (03PS1) 10ArielGlenn: add hostname for third party mirror of "other" files to rsync clients list [puppet] - 10https://gerrit.wikimedia.org/r/285933 [12:01:06] (03CR) 10ArielGlenn: [C: 032] add hostname for third party mirror of "other" files to rsync clients list [puppet] - 10https://gerrit.wikimedia.org/r/285933 (owner: 10ArielGlenn) [12:02:31] Seems that our new elasticsearch servers are racked (T133772). I'm having a look at installing them. You can expect a bunch of stupid questions... [12:02:32] T133772: Rack and Setup new elastic search - https://phabricator.wikimedia.org/T133772 [12:03:26] It seems those systems have multiple NIC, but only one should be use. How do I know which MAC to put in DHCP configuration? [12:04:35] !log restarting elasticsearch server elastic1004.eqiad.wmnet (T110236) [12:04:35] T110236: Use unicast instead of multicast for node communication - https://phabricator.wikimedia.org/T110236 [12:04:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [12:16:14] 06Operations, 10Traffic, 07HTTPS, 13Patch-For-Review: enable https for (ubuntu|apt|mirrors).wikimedia.org - https://phabricator.wikimedia.org/T132450#2248063 (10Chmarkine) I just tried on a new install of Ubuntu 12.04.5 Desktop, and apt-transport-https is installed out of box. ``` apt-transport-https: I... [12:19:27] RECOVERY - puppet last run on install2001 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [12:21:09] (03CR) 10Giuseppe Lavagetto: "Some minor comments, LGTM otherwise" (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/276243 (https://phabricator.wikimedia.org/T92813) (owner: 10Filippo Giunchedi) [12:21:42] (03PS1) 10Dereckson: Redirect m.wikipedia.org to portal [puppet] - 10https://gerrit.wikimedia.org/r/285936 (https://phabricator.wikimedia.org/T69015) [12:31:39] !log restarting sanitarium:s3 instance- query stuck again [12:31:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [12:31:53] !log Increase eqiad masters expire_logs_days (according to available space) T133333 [12:31:54] T133333: Audit new eqiad masters configuration - https://phabricator.wikimedia.org/T133333 [12:31:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [12:32:12] PROBLEM - aqs endpoints health on aqs1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [12:33:07] !log upgrade/rolling restart of mediawiki canaries for pcre upgrade [12:33:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [12:35:31] RECOVERY - aqs endpoints health on aqs1002 is OK: All endpoints are healthy [12:43:40] 06Operations, 10MediaWiki-extensions-ZeroBanner, 10Traffic, 10Wikimedia-Apache-configuration, and 3 others: m.wikipedia.org incorrectly redirects to en.m.wikipedia.org - https://phabricator.wikimedia.org/T69015#2248243 (10BBlack) [12:49:42] (03PS6) 10Elukey: Add conf200[123] Zookeeper Service nodes in codfw. [puppet] - 10https://gerrit.wikimedia.org/r/285393 (https://phabricator.wikimedia.org/T131959) [12:50:37] !log restarting elasticsearch server elastic1005.eqiad.wmnet (T110236) [12:50:38] T110236: Use unicast instead of multicast for node communication - https://phabricator.wikimedia.org/T110236 [12:50:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [12:52:58] 06Operations, 10Traffic: Varnish configuration for mobile domains should be coherent with Apache configuration - https://phabricator.wikimedia.org/T133895#2248306 (10Dereckson) [12:53:56] (03PS1) 10Rush: kubernetes upgrade to v1.2.3wmf1 [puppet] - 10https://gerrit.wikimedia.org/r/285939 [12:54:15] (03PS2) 10Rush: kubernetes upgrade to v1.2.3wmf1 [puppet] - 10https://gerrit.wikimedia.org/r/285939 [12:54:18] (03CR) 10jenkins-bot: [V: 04-1] kubernetes upgrade to v1.2.3wmf1 [puppet] - 10https://gerrit.wikimedia.org/r/285939 (owner: 10Rush) [12:56:18] (03CR) 10Yuvipanda: [C: 031] kubernetes upgrade to v1.2.3wmf1 [puppet] - 10https://gerrit.wikimedia.org/r/285939 (owner: 10Rush) [12:56:35] (03CR) 10Rush: [C: 032] kubernetes upgrade to v1.2.3wmf1 [puppet] - 10https://gerrit.wikimedia.org/r/285939 (owner: 10Rush) [12:58:58] (03PS7) 10Elukey: Add conf200[123] Zookeeper Service nodes in codfw. [puppet] - 10https://gerrit.wikimedia.org/r/285393 (https://phabricator.wikimedia.org/T131959) [13:00:41] 06Operations, 10Traffic: Varnish configuration for mobile domains should be coherent with Apache configuration - https://phabricator.wikimedia.org/T133895#2248387 (10Dereckson) >>! In T69015#2248241, @BBlack wrote: > Specifically the 3x set req.http.MobileHost regex lines (lines 34-36 currently), they seem to,... [13:07:01] PROBLEM - puppet last run on mw2171 is CRITICAL: CRITICAL: puppet fail [13:07:21] 06Operations, 06Labs, 10Labs-Infrastructure: Estimate hardware requirements for relevance lab elasticsearch servers - https://phabricator.wikimedia.org/T128433#2248394 (10Gehel) 05Open>03Resolved [13:07:55] 06Operations, 06Labs, 10Labs-Infrastructure: Estimate hardware requirements for relevance lab elasticsearch servers - https://phabricator.wikimedia.org/T128433#2074588 (10Gehel) Closing this as resolved, decision on hardware sizing has been taken on T131184, with input from this task. [13:10:25] (03PS8) 10Elukey: Add conf200[123] Zookeeper Service nodes in codfw. [puppet] - 10https://gerrit.wikimedia.org/r/285393 (https://phabricator.wikimedia.org/T131959) [13:10:44] (03CR) 10Volans: [C: 031] "1 non-blocking comment" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/285928 (https://phabricator.wikimedia.org/T125028) (owner: 10Jcrespo) [13:14:10] PROBLEM - Redis status tcp_6479 on rdb2006 is CRITICAL: CRITICAL: replication_delay is 636 600 - REDIS on 10.192.48.44:6479 has 1 databases (db0) with 5114386 keys - replication_delay is 636 [13:16:02] RECOVERY - Redis status tcp_6479 on rdb2006 is OK: OK: REDIS on 10.192.48.44:6479 has 1 databases (db0) with 5083026 keys - replication_delay is 0 [13:19:46] (03CR) 10Jcrespo: Config changes for db1038 (old s3 master) reimaging (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/285928 (https://phabricator.wikimedia.org/T125028) (owner: 10Jcrespo) [13:20:23] 06Operations, 10Traffic: Varnish configuration for mobile domains should be coherent with Apache configuration - https://phabricator.wikimedia.org/T133895#2248432 (10Dereckson) Actually, the text-frontend.inc.vcl.erb code is mostly correct: `regsub(req.http.MobileHost, "^(www\.)?(mediawiki|wikimediafoundation... [13:20:57] 06Operations, 10Wikimedia-Site-requests, 07Wikimedia-log-errors: Requests to localhost spam the 'localhost' and 'xff' log buckets - https://phabricator.wikimedia.org/T129982#2248435 (10fgiunchedi) p:05Triage>03Normal [13:21:13] 06Operations, 10MediaWiki-General-or-Unknown, 06Release-Engineering-Team: Intermittent read-only errors on s3 wikis on March 14th - https://phabricator.wikimedia.org/T129947#2248437 (10fgiunchedi) p:05Triage>03Normal [13:22:29] (03Abandoned) 10Elukey: Allow basic apache maintenace webpages for the statistics::web role. [puppet] - 10https://gerrit.wikimedia.org/r/284878 (https://phabricator.wikimedia.org/T76348) (owner: 10Elukey) [13:22:47] (03PS3) 10Elukey: Example of possible configuration to run mc2009 with the latest memcached version. [puppet] - 10https://gerrit.wikimedia.org/r/284907 (https://phabricator.wikimedia.org/T129963) [13:23:56] !log rebooting cp1008 [13:24:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [13:25:27] (03PS1) 10Dereckson: Varnish: don't redirect www.wikimediafoundation.org to m.* [puppet] - 10https://gerrit.wikimedia.org/r/285944 [13:28:45] 06Operations, 06Discovery, 06Discovery-Search-Backlog, 07Elasticsearch: Improve Elasticsearch icinga alerting - https://phabricator.wikimedia.org/T133844#2248466 (10Gehel) [13:28:54] 06Operations, 06Discovery, 06Discovery-Search-Backlog, 07Elasticsearch: Improve Elasticsearch icinga alerting - https://phabricator.wikimedia.org/T133844#2247220 (10Gehel) [13:29:37] (03Abandoned) 10Hashar: hhvm: log dir creation requires rsyslog [puppet] - 10https://gerrit.wikimedia.org/r/285526 (owner: 10Hashar) [13:29:46] (03CR) 10Mobrovac: [C: 04-1] "It seems a copy/paste fail sneaked in, otherwise LGTM." (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/285393 (https://phabricator.wikimedia.org/T131959) (owner: 10Elukey) [13:30:39] y u no running phd? [13:32:57] PROBLEM - Host lvs1012 is DOWN: PING CRITICAL - Packet loss = 100% [13:33:18] RECOVERY - Host lvs1012 is UP: PING OK - Packet loss = 0%, RTA = 1.94 ms [13:33:24] 06Operations, 10MediaWiki-General-or-Unknown, 06Release-Engineering-Team: Intermittent read-only errors on s3 wikis on March 14th - https://phabricator.wikimedia.org/T129947#2248480 (10jcrespo) 05Open>03Resolved a:03jcrespo I would close this, AFAIK this didn't repeat, and after failover, state is comp... [13:33:35] (03PS9) 10Elukey: Add conf200[123] Zookeeper Service nodes in codfw. [puppet] - 10https://gerrit.wikimedia.org/r/285393 (https://phabricator.wikimedia.org/T131959) [13:35:37] RECOVERY - puppet last run on mw2171 is OK: OK: Puppet is currently enabled, last run 10 seconds ago with 0 failures [13:36:50] 06Operations, 06Performance-Team, 10Traffic, 13Patch-For-Review: Support HTTP/2 - https://phabricator.wikimedia.org/T96848#2248503 (10BBlack) Continuing on the saga above: building the ko for various kernels either doesn't work at all or requires different versions of the systemtap building tools than what... [13:38:01] phd start: Unable to start daemons because daemons are already running. [13:38:12] Then why are you telling me they're not running in the UI? [13:38:16] * ostriches finds something stabby [13:41:25] (03PS1) 10Hashar: hhvm: vary /var/log/hhvm owner based on distro [puppet] - 10https://gerrit.wikimedia.org/r/285945 [13:41:37] (03CR) 10Hashar: "follow up on https://gerrit.wikimedia.org/r/285945" [puppet] - 10https://gerrit.wikimedia.org/r/285526 (owner: 10Hashar) [13:44:14] (03PS1) 10Muehlenhoff: Add ferm rules for role::mariadb::tendril [puppet] - 10https://gerrit.wikimedia.org/r/285946 [13:44:16] (03PS1) 10Muehlenhoff: Enable base::firewall on db1011 [puppet] - 10https://gerrit.wikimedia.org/r/285947 [13:45:48] (03CR) 10Hashar: "Compiled for mw1090.eqiad.wmnet show it is a noop https://puppet-compiler.wmflabs.org/2617/" [puppet] - 10https://gerrit.wikimedia.org/r/285945 (owner: 10Hashar) [13:46:26] (03PS1) 10Yuvipanda: Fix typo for backcompat port env variable [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/285948 [13:47:38] (03PS1) 10Dereckson: Restore garfieldairlines.net feed on fr.planet [puppet] - 10https://gerrit.wikimedia.org/r/285949 [13:48:58] (03PS2) 10Dereckson: Restore garfieldairlines.net feed on fr.planet [puppet] - 10https://gerrit.wikimedia.org/r/285949 [13:49:14] (03CR) 10Dereckson: "PS2: http → https" [puppet] - 10https://gerrit.wikimedia.org/r/285949 (owner: 10Dereckson) [13:49:53] ostriches: that's a known issue [13:50:18] Silly phab. [13:50:26] Making me worry before I even coffee'd [13:50:54] 06Operations, 06Performance-Team, 10Traffic, 13Patch-For-Review: Support HTTP/2 - https://phabricator.wikimedia.org/T96848#2248553 (10BBlack) And... the output looks like that object was from a previous version of the source, the one in P2719 which lacks detection of an ALPN negotiation which specifies SPD... [13:51:28] (03CR) 10Yuvipanda: [C: 032] Fix typo for backcompat port env variable [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/285948 (owner: 10Yuvipanda) [13:54:24] (03PS2) 10Hashar: hhvm: make /var/log/hhvm owned by root [puppet] - 10https://gerrit.wikimedia.org/r/285945 [13:56:11] (03CR) 10Hashar: "Made /var/log/hhvm to be owned by 'root'." [puppet] - 10https://gerrit.wikimedia.org/r/285945 (owner: 10Hashar) [13:58:53] 06Operations, 10Monitoring: improve redis master/slave monitoring - https://phabricator.wikimedia.org/T101584#2248579 (10fgiunchedi) redis replication checks were added in https://gerrit.wikimedia.org/r/#/c/282383/ by @Joe, any other redis-related checks we should be adding? otherwise this can be resolved [13:59:26] !log restarting elasticsearch server elastic1006.eqiad.wmnet (T110236) [13:59:27] T110236: Use unicast instead of multicast for node communication - https://phabricator.wikimedia.org/T110236 [13:59:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [13:59:52] 06Operations, 06Performance-Team, 10Traffic, 13Patch-For-Review: Support HTTP/2 - https://phabricator.wikimedia.org/T96848#2248582 (10BBlack) Searched all the cache nodes for other builds that might have been left behind. Found a working compile of the latest source for 3.19.0-2 in @ema's homedir on cp104... [14:01:02] gehel: almost finished with the new elastic servers [14:01:39] cmjohnson1: thanks! [14:02:37] cmjohnson1: since you're there... how do I know which NIC is plugged to configure DHCP correctly? [14:02:47] 06Operations: Grafana: Job Queue Health: Panel is displayed incorrectly - https://phabricator.wikimedia.org/T130512#2248595 (10Luke081515) I get still the same: {F3942378} (Win7/Firefox 46) [14:02:58] it's always the first one [14:03:07] (03CR) 10Elukey: [C: 032] Add conf200[123] Zookeeper Service nodes in codfw. [puppet] - 10https://gerrit.wikimedia.org/r/285393 (https://phabricator.wikimedia.org/T131959) (owner: 10Elukey) [14:03:09] eth0 or in come cases eth1 [14:03:16] is that what you mean? [14:03:58] cmjohnson1: never overlook the obvious... [14:04:53] heh..sometimes.....it doesn't hurt simplest reasons are often the culprit of complicated problems [14:05:44] cmjohnson1: yes, probably. I need the MAC to configure DHCP, I managed to get the macs for all NICs from ILO, but I was wondering which of the 4 NIC is actually in use... [14:07:16] 06Operations, 10Traffic: Move login.wikimedia.org to its own IP address - https://phabricator.wikimedia.org/T82877#2248598 (10fgiunchedi) [14:07:33] cmjohnson1: just to make sure, the first NIC reported by ILO (show /system1/network1/Integrated_NICs) should be the one in use [14:07:34] gehel: I always plug the cable into the first port. we never use the other ports unless we're bonding. [14:07:40] yes [14:08:08] cmjohnson1: Thanks! I'll take a note for next time... [14:08:51] gehel: typically I update the dhcpd file to reflect the mac addresses as part of the setup process [14:09:22] I usually hand them off with the base installation completed and accessible via ssh [14:09:42] cmjohnson1: Thanks! I was planning to do it. But even better this way! [14:10:10] 06Operations, 10Traffic: Move login.wikimedia.org to its own IP address - https://phabricator.wikimedia.org/T82877#2248605 (10BBlack) 05Open>03declined Everything related to what this ticket is about has changed since it was written, and there was never any movement on it anyways. If someone still thinks... [14:14:32] PROBLEM - RAID on lvs1012 is CRITICAL: Timeout while attempting connection [14:14:53] PROBLEM - Redis status tcp_6479 on rdb2006 is CRITICAL: CRITICAL: replication_delay is 681 600 - REDIS on 10.192.48.44:6479 has 1 databases (db0) with 5088629 keys - replication_delay is 681 [14:16:23] RECOVERY - RAID on lvs1012 is OK: OK: no RAID installed [14:16:53] RECOVERY - Redis status tcp_6479 on rdb2006 is OK: OK: REDIS on 10.192.48.44:6479 has 1 databases (db0) with 5082755 keys - replication_delay is 0 [14:19:22] 06Operations, 10Monitoring: Add RAID monitoring for HP servers - https://phabricator.wikimedia.org/T97998#2248618 (10faidon) [14:20:41] (03CR) 10Paladox: [C: 031] hhvm: make /var/log/hhvm owned by root [puppet] - 10https://gerrit.wikimedia.org/r/285945 (owner: 10Hashar) [14:25:57] !log deployed new zookeeper nodes in codfw (conf200[123]) [14:25:59] !log started SPDY stats sample on 8x caches - T96848#2248582 [14:26:02] T96848: Support HTTP/2 - https://phabricator.wikimedia.org/T96848 [14:26:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:26:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:27:06] (03PS1) 10Cmjohnson: Adding production dns for new elastic search serves elastic1032-1047 [dns] - 10https://gerrit.wikimedia.org/r/285956 [14:27:42] (03PS2) 10Giuseppe Lavagetto: mediawiki::web: drop HHVM define/Zend conditionals in all vhosts [puppet] - 10https://gerrit.wikimedia.org/r/285367 (https://phabricator.wikimedia.org/T126310) [14:28:09] (03CR) 10Cmjohnson: [C: 032] Adding production dns for new elastic search serves elastic1032-1047 [dns] - 10https://gerrit.wikimedia.org/r/285956 (owner: 10Cmjohnson) [14:28:23] 06Operations, 10ops-codfw, 13Patch-For-Review: rack/setup/deploy conf200[123] - https://phabricator.wikimedia.org/T131959#2248623 (10elukey) ``` elukey@conf2001:~$ /usr/share/zookeeper/bin/zkServer.sh status JMX enabled by default Using config: /etc/zookeeper/conf/zoo.cfg Mode: follower elukey@conf2002:~$ /... [14:28:38] (03CR) 10Gehel: "Puppet compiler output: https://puppet-compiler.wmflabs.org/2616/" [puppet] - 10https://gerrit.wikimedia.org/r/268215 (owner: 10EBernhardson) [14:31:38] (03PS1) 10Hashar: nodepool: bump # of instances [puppet] - 10https://gerrit.wikimedia.org/r/285957 (https://phabricator.wikimedia.org/T133911) [14:32:00] (03CR) 10BBlack: [C: 031] "All the diffs appear to fall into one of two categories:" [puppet] - 10https://gerrit.wikimedia.org/r/285367 (https://phabricator.wikimedia.org/T126310) (owner: 10Giuseppe Lavagetto) [14:32:05] (03CR) 10Hashar: [C: 04-1] "Pending T133911" [puppet] - 10https://gerrit.wikimedia.org/r/285957 (https://phabricator.wikimedia.org/T133911) (owner: 10Hashar) [14:32:14] !log wdqs-updater started on wdqs1002 (T133566) [14:32:15] T133566: Reinstall and data reload of WDQS servers - https://phabricator.wikimedia.org/T133566 [14:32:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:33:32] PROBLEM - Redis status tcp_6479 on rdb2006 is CRITICAL: CRITICAL: replication_delay is 713 600 - REDIS on 10.192.48.44:6479 has 1 databases (db0) with 5083171 keys - replication_delay is 713 [14:35:32] (03PS1) 10Elukey: Enable kafka200[12] to host Kafka and Event Bus [puppet] - 10https://gerrit.wikimedia.org/r/285958 (https://phabricator.wikimedia.org/T121558) [14:35:49] (03PS1) 10ArielGlenn: drop the user check on the kiwix rsync pgrep, not needed [puppet] - 10https://gerrit.wikimedia.org/r/285959 [14:36:43] (03CR) 10Giuseppe Lavagetto: [C: 032] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/285367 (https://phabricator.wikimedia.org/T126310) (owner: 10Giuseppe Lavagetto) [14:36:59] (03CR) 10ArielGlenn: [C: 032] drop the user check on the kiwix rsync pgrep, not needed [puppet] - 10https://gerrit.wikimedia.org/r/285959 (owner: 10ArielGlenn) [14:37:12] (03PS2) 10ArielGlenn: drop the user check on the kiwix rsync pgrep, not needed [puppet] - 10https://gerrit.wikimedia.org/r/285959 [14:39:15] (03PS1) 10Filippo Giunchedi: releases: use secret() for gpg keyring [puppet] - 10https://gerrit.wikimedia.org/r/285961 [14:39:21] (03PS2) 10Elukey: Enable kafka200[12] to host Kafka and EventBus. [puppet] - 10https://gerrit.wikimedia.org/r/285958 (https://phabricator.wikimedia.org/T121558) [14:39:23] _joe_: about to merge your stuff on palladium [14:39:47] <_joe_> apergos: wait please [14:39:59] 1 second too late :-( [14:40:11] _joe_: [14:40:14] <_joe_> apergos: well I shouted out it was a big change [14:40:19] <_joe_> luckily I was ready [14:40:35] 06Operations, 05codfw-rollout, 03codfw-rollout-Jan-Mar-2016: url-downloader should be set up more redundantly - https://phabricator.wikimedia.org/T122134#2248679 (10faidon) I still see a "url-downloader.wikimedia.org" reference in mediawiki-config, this should probably be amended to reference the per-site eq... [14:40:57] (03PS1) 10Muehlenhoff: Puppetise yubikey-val (WIP) [puppet] - 10https://gerrit.wikimedia.org/r/285962 [14:41:13] sorry about that [14:42:09] (03PS1) 10Cmjohnson: Adding mac addresses to dhcpd file for elatic1032-1047 [puppet] - 10https://gerrit.wikimedia.org/r/285963 [14:42:27] (03CR) 10Hashar: [C: 031] "As a workaround, I went with creating a 'syslog' user on Jessie" [puppet] - 10https://gerrit.wikimedia.org/r/285945 (owner: 10Hashar) [14:42:31] 06Operations, 10Deployment-Systems, 06Release-Engineering-Team, 03Scap3 (Scap3-MediaWiki-MVP): Completely port l10nupdate to scap - https://phabricator.wikimedia.org/T133913#2248683 (10mmodell) [14:43:52] (03PS2) 10Cmjohnson: Adding mac addresses to dhcpd file for elatic1032-1047 [puppet] - 10https://gerrit.wikimedia.org/r/285963 [14:46:10] (03CR) 10Elukey: "Puppet compiler looks ok: https://puppet-compiler.wmflabs.org/2622/" [puppet] - 10https://gerrit.wikimedia.org/r/285958 (https://phabricator.wikimedia.org/T121558) (owner: 10Elukey) [14:46:21] (03PS3) 10Cmjohnson: Adding mac addresses to dhcpd file for elatic1032-1047 [puppet] - 10https://gerrit.wikimedia.org/r/285963 [14:46:34] 06Operations, 10ops-codfw: rack/setup/deploy restbase200[7-9] - https://phabricator.wikimedia.org/T132976#2248701 (10Papaul) restbase2007 is the server with the Intel SSDs and it is racked in B1 and restbaset2009 is rack in D1. I will move restbase2007 in D1 and rename it restbase2009 and move restbase2009 whi... [14:47:23] (03CR) 10Cmjohnson: [C: 032] Adding mac addresses to dhcpd file for elatic1032-1047 [puppet] - 10https://gerrit.wikimedia.org/r/285963 (owner: 10Cmjohnson) [14:49:27] (03PS2) 10Filippo Giunchedi: releases: use secret() for gpg keyring [puppet] - 10https://gerrit.wikimedia.org/r/285961 [14:49:59] puppet disabled on 400 hosts right now [14:50:23] RECOVERY - Redis status tcp_6479 on rdb2006 is OK: OK: REDIS on 10.192.48.44:6479 has 1 databases (db0) with 5082543 keys - replication_delay is 0 [14:52:43] (03PS4) 10Muehlenhoff: jsbench: Add ferm rules for xvfb [puppet] - 10https://gerrit.wikimedia.org/r/282318 [14:54:30] (03PS1) 10Cmjohnson: Adding the new elastic search servers to netboot.cfg elastic-raid1 cfg [puppet] - 10https://gerrit.wikimedia.org/r/285965 [14:54:35] I'll jump in to take SWAT this morning if there are no objections. [14:54:42] (03PS9) 10Volans: MariaDB: Load and enable semi-sync replication [puppet] - 10https://gerrit.wikimedia.org/r/285649 (https://phabricator.wikimedia.org/T133333) [14:55:56] (03CR) 10Volans: [C: 032] MariaDB: Load and enable semi-sync replication [puppet] - 10https://gerrit.wikimedia.org/r/285649 (https://phabricator.wikimedia.org/T133333) (owner: 10Volans) [14:56:59] (03PS2) 10Jcrespo: Config changes for db1038 (old s3 master) reimaging [puppet] - 10https://gerrit.wikimedia.org/r/285928 (https://phabricator.wikimedia.org/T125028) [14:57:30] (03PS2) 10Cmjohnson: Adding the new elastic search servers to netboot.cfg elastic-raid1 cfg [puppet] - 10https://gerrit.wikimedia.org/r/285965 [14:57:43] (03CR) 10Jcrespo: [C: 032 V: 032] Config changes for db1038 (old s3 master) reimaging [puppet] - 10https://gerrit.wikimedia.org/r/285928 (https://phabricator.wikimedia.org/T125028) (owner: 10Jcrespo) [14:58:49] (03PS3) 10Cmjohnson: Adding the new elastic search servers to netboot.cfg elastic-raid1 cfg [puppet] - 10https://gerrit.wikimedia.org/r/285965 [15:00:04] anomie ostriches thcipriani marktraceur aude: Dear anthropoid, the time has come. Please deploy Morning SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160428T1500). [15:00:05] mobrovac: A patch you scheduled for Morning SWAT (Max 8 patches) is about to be deployed. Please be available during the process. [15:00:39] (03CR) 10Jcrespo: [C: 031] "This is ready to go, let's schedule it at some time where we are both connected and available." [puppet] - 10https://gerrit.wikimedia.org/r/285947 (owner: 10Muehlenhoff) [15:00:56] I can SWAT today. mobrovac ping me when you're around. [15:01:26] (03CR) 10Cmjohnson: [C: 032] Adding the new elastic search servers to netboot.cfg elastic-raid1 cfg [puppet] - 10https://gerrit.wikimedia.org/r/285965 (owner: 10Cmjohnson) [15:01:29] (03CR) 10Jcrespo: [C: 031] "Idem than gerrit:285947" [puppet] - 10https://gerrit.wikimedia.org/r/285946 (owner: 10Muehlenhoff) [15:03:08] (03PS1) 10Gehel: Do not map new elasticsearch servers in site.pp [puppet] - 10https://gerrit.wikimedia.org/r/285967 [15:05:27] RECOVERY - puppet last run on lvs1012 is OK: OK: Puppet is currently enabled, last run 22 seconds ago with 0 failures [15:05:57] !log restarting db1038 for reimage to jessie [15:06:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:06:31] (03CR) 10Gehel: "Puppet compiler: https://puppet-compiler.wmflabs.org/2626/" [puppet] - 10https://gerrit.wikimedia.org/r/285967 (owner: 10Gehel) [15:06:55] 06Operations, 10ops-eqiad, 10Traffic: lvs1012 eth1 NIC Link flapping - https://phabricator.wikimedia.org/T133915#2248747 (10faidon) [15:07:18] gehel: what's with the puppet disabled across the elasticNNNN fleet? [15:07:38] I would suppose the downtime for that just expired, and it is old [15:08:07] paravoid: my bad, I should have reenabled it after investigation of cluster issues yesterday. [15:08:22] paravoid: I'll re-enable right now... [15:08:23] _joe_ > when we update modules/mediawiki/files/apache/sites/redirects/redirects.dat, we need to run refreshDomainRedirects ourselves or is that done automatically as a part of the Puppet compilation process? [15:08:25] ok [15:08:35] _joe_: you ran it yourself [15:08:35] <_joe_> Dereckson: the former [15:08:39] k [15:08:39] and commit both [15:08:59] or you review https://gerrit.wikimedia.org/r/#/c/138292/ and nag people to merge it :P [15:09:07] PROBLEM - puppet last run on elastic2006 is CRITICAL: CRITICAL: Puppet last ran 1 day ago [15:09:08] PROBLEM - puppet last run on elastic2016 is CRITICAL: CRITICAL: Puppet last ran 1 day ago [15:09:08] PROBLEM - puppet last run on elastic2007 is CRITICAL: CRITICAL: Puppet last ran 1 day ago [15:09:08] PROBLEM - puppet last run on elastic2009 is CRITICAL: CRITICAL: Puppet last ran 1 day ago [15:09:17] PROBLEM - puppet last run on elastic2022 is CRITICAL: CRITICAL: Puppet last ran 1 day ago [15:09:29] PROBLEM - puppet last run on elastic2008 is CRITICAL: CRITICAL: Puppet last ran 1 day ago [15:09:38] PROBLEM - puppet last run on elastic2018 is CRITICAL: CRITICAL: Puppet last ran 1 day ago [15:09:47] PROBLEM - puppet last run on elastic2005 is CRITICAL: CRITICAL: Puppet last ran 1 day ago [15:09:47] PROBLEM - puppet last run on elastic2012 is CRITICAL: CRITICAL: Puppet last ran 1 day ago [15:09:47] PROBLEM - puppet last run on elastic2017 is CRITICAL: CRITICAL: Puppet last ran 1 day ago [15:09:48] PROBLEM - puppet last run on elastic2013 is CRITICAL: CRITICAL: Puppet last ran 1 day ago [15:09:50] paravoid: I get the icinga alerts when re-enabling puppet ? [15:09:57] PROBLEM - puppet last run on elastic2011 is CRITICAL: CRITICAL: Puppet last ran 1 day ago [15:09:58] PROBLEM - puppet last run on elastic2002 is CRITICAL: CRITICAL: Puppet last ran 1 day ago [15:10:04] <_joe_> gehel: ain't that fantastic? [15:10:08] PROBLEM - puppet last run on elastic2024 is CRITICAL: CRITICAL: Puppet last ran 1 day ago [15:10:09] PROBLEM - puppet last run on elastic2004 is CRITICAL: CRITICAL: Puppet last ran 1 day ago [15:10:09] PROBLEM - puppet last run on elastic2010 is CRITICAL: CRITICAL: Puppet last ran 1 day ago [15:10:16] :-D [15:10:17] PROBLEM - puppet last run on elastic2020 is CRITICAL: CRITICAL: Puppet last ran 1 day ago [15:10:18] PROBLEM - puppet last run on elastic2001 is CRITICAL: CRITICAL: Puppet last ran 1 day ago [15:10:18] a bit counter intuitive... [15:10:18] PROBLEM - puppet last run on elastic2014 is CRITICAL: CRITICAL: Puppet last ran 1 day ago [15:10:34] sorry for the spam, I did not expect that :P [15:10:38] PROBLEM - puppet last run on elastic2003 is CRITICAL: CRITICAL: Puppet last ran 1 day ago [15:10:47] PROBLEM - puppet last run on elastic2015 is CRITICAL: CRITICAL: Puppet last ran 1 day ago [15:10:48] PROBLEM - puppet last run on elastic2021 is CRITICAL: CRITICAL: Puppet last ran 1 day ago [15:10:48] PROBLEM - puppet last run on elastic2019 is CRITICAL: CRITICAL: Puppet last ran 1 day ago [15:10:48] PROBLEM - puppet last run on elastic2023 is CRITICAL: CRITICAL: Puppet last ran 1 day ago [15:11:03] if puppet is disabled and hasn't ran in a while, it's WARNING [15:11:14] (03PS3) 10Dereckson: Apache Redirects for w.wiki [puppet] - 10https://gerrit.wikimedia.org/r/285932 (https://phabricator.wikimedia.org/T108557) [15:11:16] if puppet is not disabled and hasn't ran in a while, it's a CRITICAL, hence the IRC ping [15:11:41] paravoid: it actually make sense... ! [15:12:18] RECOVERY - puppet last run on elastic2001 is OK: OK: Puppet is currently enabled, last run 47 seconds ago with 0 failures [15:12:24] !log restarting elasticsearch server elastic1007.eqiad.wmnet (T110236) [15:12:25] T110236: Use unicast instead of multicast for node communication - https://phabricator.wikimedia.org/T110236 [15:12:25] yeah but the net result behavior in IRC often doesn't make sense [15:12:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:12:45] sorry, but there is also going to be some spam during recovery... [15:12:48] RECOVERY - puppet last run on elastic2019 is OK: OK: Puppet is currently enabled, last run 34 seconds ago with 0 failures [15:13:08] RECOVERY - puppet last run on elastic2009 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [15:13:08] RECOVERY - puppet last run on elastic2022 is OK: OK: Puppet is currently enabled, last run 12 seconds ago with 0 failures [15:13:12] when you disable puppet for a while, there's no CRITICAL echo'd to IRC when it's disable or while it's disabled. But when you later re-enable puppet and run it successfully, there's a CRIT alert about 'last ran 1 day ago' that's about to clear it self. [15:13:19] RECOVERY - puppet last run on elastic2008 is OK: OK: Puppet is currently enabled, last run 17 seconds ago with 0 failures [15:13:37] RECOVERY - puppet last run on elastic2012 is OK: OK: Puppet is currently enabled, last run 33 seconds ago with 0 failures [15:13:48] RECOVERY - puppet last run on elastic2002 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [15:13:50] doing the thing that causes a less-than-ideal state says nothing. doing the thing that fixes that less-than-ideal state spams immediate CRIT->RECOVERs [15:13:59] RECOVERY - puppet last run on elastic2010 is OK: OK: Puppet is currently enabled, last run 42 seconds ago with 0 failures [15:14:08] RECOVERY - puppet last run on elastic2020 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [15:14:08] RECOVERY - puppet last run on elastic2014 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [15:14:28] RECOVERY - puppet last run on elastic2003 is OK: OK: Puppet is currently enabled, last run 9 seconds ago with 0 failures [15:14:38] RECOVERY - puppet last run on elastic2015 is OK: OK: Puppet is currently enabled, last run 23 seconds ago with 0 failures [15:14:39] RECOVERY - puppet last run on elastic2021 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [15:14:47] RECOVERY - puppet last run on elastic2023 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [15:14:57] RECOVERY - puppet last run on elastic2006 is OK: OK: Puppet is currently enabled, last run 40 seconds ago with 0 failures [15:14:58] RECOVERY - puppet last run on elastic2016 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [15:14:58] RECOVERY - puppet last run on elastic2007 is OK: OK: Puppet is currently enabled, last run 40 seconds ago with 0 failures [15:15:28] RECOVERY - puppet last run on elastic2018 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [15:15:28] 07Blocked-on-Operations, 10OOjs-UI, 05Continuous-Integration-Scaling, 13Patch-For-Review, 07WorkType-NewFunctionality: Provide composer on the nodepool servers so OOjs UI can use it in the npm job - https://phabricator.wikimedia.org/T128092#2248787 (10hashar) Did progress on provisioning PHP: | Distro |... [15:15:37] RECOVERY - puppet last run on elastic2005 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [15:15:37] RECOVERY - puppet last run on elastic2017 is OK: OK: Puppet is currently enabled, last run 12 seconds ago with 0 failures [15:15:37] RECOVERY - puppet last run on elastic2013 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [15:15:48] RECOVERY - puppet last run on elastic2011 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [15:15:52] 07Blocked-on-Operations, 10OOjs-UI, 05Continuous-Integration-Scaling, 13Patch-For-Review, 07WorkType-NewFunctionality: Provide composer on the nodepool servers so OOjs UI can use it in the npm job - https://phabricator.wikimedia.org/T128092#2248788 (10hashar) So in theory if we set `PHP_BIN=hhvm` for the... [15:15:55] probably the Right Thing is to split this into two separate alerts: whether the last puppet run was a success, and whether puppet's been disabled for > X time. [15:15:57] RECOVERY - puppet last run on elastic2024 is OK: OK: Puppet is currently enabled, last run 41 seconds ago with 0 failures [15:16:07] RECOVERY - puppet last run on elastic2004 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [15:16:39] when there's no real puppetfails and say X is 4 hours and you disable it and then re-enable + success-run 12 hours later... there would be CRITs at the 4h mark and recoveries as the 12h mark, and it would make sense [15:17:24] 06Operations, 10OCG-General, 06Services, 13Patch-For-Review: Implement flag to tell an OCG machine not to take new tasks from the redis task queue - https://phabricator.wikimedia.org/T120077#2248789 (10cscott) And the cleanup script can be integrated (once we test & validate it) so that a machine which fin... [15:21:06] 06Operations, 10Deployment-Systems, 06Release-Engineering-Team, 03Scap3 (Scap3-MediaWiki-MVP): Completely port l10nupdate to scap - https://phabricator.wikimedia.org/T133913#2248812 (10Reedy) https://github.com/wikimedia/operations-puppet/blob/production/modules/scap/files/l10nupdate-1 Will still need to... [15:21:59] (03CR) 10Eevans: [WIP]: Cassandra 2.2.5 config (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/284078 (https://phabricator.wikimedia.org/T126629) (owner: 10Eevans) [15:22:25] thcipriani: ping [15:22:48] mobrovac: howdy [15:23:28] (03CR) 10Thcipriani: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/283269 (https://phabricator.wikimedia.org/T132096) (owner: 10Mobrovac) [15:23:31] !log puppet disabled on mc2009 as preparation step for https://gerrit.wikimedia.org/r/#/c/284907 [15:23:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:23:48] 07Blocked-on-Operations, 10OOjs-UI, 05Continuous-Integration-Scaling, 13Patch-For-Review, 07WorkType-NewFunctionality: Provide composer on the nodepool servers so OOjs UI can use it in the npm job - https://phabricator.wikimedia.org/T128092#2248816 (10hashar) And composer is available. Will follow up on... [15:23:55] 07Blocked-on-Operations, 10OOjs-UI, 05Continuous-Integration-Scaling, 13Patch-For-Review, 07WorkType-NewFunctionality: Provide composer on the nodepool servers so OOjs UI can use it in the npm job - https://phabricator.wikimedia.org/T128092#2248818 (10hashar) 05Open>03Resolved [15:24:05] (03PS4) 10Elukey: Example of possible configuration to run mc2009 with the latest memcached version. [puppet] - 10https://gerrit.wikimedia.org/r/284907 (https://phabricator.wikimedia.org/T129963) [15:24:19] (03PS1) 10Dereckson: Clarify header Documentation for Apache redirects [puppet] - 10https://gerrit.wikimedia.org/r/285973 [15:24:42] (03PS2) 10Thcipriani: Math: increase the number of concurrent connections to 150 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/283269 (https://phabricator.wikimedia.org/T132096) (owner: 10Mobrovac) [15:24:55] (03CR) 10Thcipriani: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/283269 (https://phabricator.wikimedia.org/T132096) (owner: 10Mobrovac) [15:24:57] (03PS2) 10Dereckson: Clarify header documentation for Apache redirects [puppet] - 10https://gerrit.wikimedia.org/r/285973 [15:25:10] I forgot mw-config is now a ff-only repo [15:25:39] (03Merged) 10jenkins-bot: Math: increase the number of concurrent connections to 150 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/283269 (https://phabricator.wikimedia.org/T132096) (owner: 10Mobrovac) [15:26:09] mobrovac: any special setup needed on your end? Good to sync? [15:26:23] thcipriani: nope, good to go for sync [15:27:11] (03CR) 10Giuseppe Lavagetto: [C: 031] "thanks" [puppet] - 10https://gerrit.wikimedia.org/r/285973 (owner: 10Dereckson) [15:27:25] !log gehel@palladium conftool action : get/pooled; selector: elastic1001.eqiad.wmnet [15:27:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:27:46] !log thcipriani@tin Synchronized wmf-config/CommonSettings.php: SWAT: Math: increase the number of concurrent connections to 150 [[gerrit:283269]] (duration: 00m 35s) [15:27:51] Ow, reading conf also gets logged? [15:27:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:27:53] ^ mobrovac check please [15:28:32] (03CR) 10Elukey: [C: 032] Example of possible configuration to run mc2009 with the latest memcached version. [puppet] - 10https://gerrit.wikimedia.org/r/284907 (https://phabricator.wikimedia.org/T129963) (owner: 10Elukey) [15:28:47] thcipriani: looking good! [15:28:49] thcipriani: thnx! [15:28:54] mobrovac: thanks for checking! [15:29:33] (03PS4) 10Dereckson: Apache redirects for w.wiki [puppet] - 10https://gerrit.wikimedia.org/r/285932 (https://phabricator.wikimedia.org/T108557) [15:31:32] (03PS2) 10Gehel: Do not map new elasticsearch servers in site.pp [puppet] - 10https://gerrit.wikimedia.org/r/285967 [15:32:35] PROBLEM - Memcached on mc2009 is CRITICAL: Connection refused [15:33:30] (03CR) 10Gehel: [C: 032] Do not map new elasticsearch servers in site.pp [puppet] - 10https://gerrit.wikimedia.org/r/285967 (owner: 10Gehel) [15:33:48] 06Operations, 10DBA, 13Patch-For-Review: Set up TLS for MariaDB replication - https://phabricator.wikimedia.org/T111654#2248862 (10Volans) = Update on current Status === MySQL Replica - Thanks to the switchover to codfw we were able to complete the coredb masters failover/restart on eqiad hence now the circu... [15:35:06] !log installed memcached 1.4.25-2 (Debian sid/testing) in mc2009 as part of performance test (T129963) [15:35:07] T129963: Update memcached package and configuration options - https://phabricator.wikimedia.org/T129963 [15:35:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:36:05] RECOVERY - Memcached on mc2009 is OK: TCP OK - 0.037 second response time on port 11211 [15:37:40] 06Operations, 06Performance-Team, 13Patch-For-Review: Update memcached package and configuration options - https://phabricator.wikimedia.org/T129963#2248870 (10elukey) ``` elukey@mc2009:~$ ps aux | grep memcached nobody 12327 0.0 0.0 327708 2676 ? Ssl 15:33 0:00 /usr/bin/memcached -p 11211 -u... [15:37:54] 06Operations, 10OCG-General, 05codfw-rollout: Document eqiad/codfw transition plan for OCG - https://phabricator.wikimedia.org/T133164#2248871 (10cscott) The code is just using `os.hostname` in node, so FQDN is probably a VM configuration issue rather than an OCG code issue. If I understand the lock suggest... [15:39:21] 07Blocked-on-Operations, 10OOjs-UI, 05Continuous-Integration-Scaling, 13Patch-For-Review, 07WorkType-NewFunctionality: Provide composer on the nodepool servers so OOjs UI can use it in the npm job - https://phabricator.wikimedia.org/T128092#2248872 (10Jdforrester-WMF) \o/ [15:44:37] 06Operations, 10OCG-General, 05codfw-rollout: Document eqiad/codfw transition plan for OCG - https://phabricator.wikimedia.org/T133164#2248877 (10Joe) @cscott let's focus on the simpler issues instead than on changing the storage system. As for getting the FQDN, suggestions I see are like http://unix.stacke... [15:47:35] 06Operations, 10Traffic, 10Wiki-Loves-Monuments-General, 07HTTPS: configure https for www.wikilovesmonuments.org - https://phabricator.wikimedia.org/T118388#2248879 (10JanZerebecki) A CSR will only provide one certificate. > Via Let's Encrypt, it is not possible to have all the necessary certificates. Ca... [15:52:03] (03CR) 10Gehel: ""puppet resource file /etc/logrotate.d/cirrus-suggest" on terbium show the file as managed." [puppet] - 10https://gerrit.wikimedia.org/r/268215 (owner: 10EBernhardson) [15:54:02] (03PS1) 10Elukey: Return a custom HTTP 503 response for all the stat1001 websites due to maintenance. [puppet] - 10https://gerrit.wikimedia.org/r/285976 (https://phabricator.wikimedia.org/T76348) [15:56:39] 06Operations, 10OCG-General, 05codfw-rollout: Use FQDNs instead of hostnames in the download urls sent to Mediawiki - https://phabricator.wikimedia.org/T133864#2248916 (10cscott) The code is just using [`os.hostname`](https://nodejs.org/api/os.html#os_os_hostname) in node, so FQDN is probably a VM configurat... [16:00:04] godog moritzm: Respected human, time to deploy Puppet SWAT(Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160428T1600). Please do the needful. [16:03:53] Dereckson: how was https://gerrit.wikimedia.org/r/#/c/285932/4 tested? [16:04:11] I have stuff to add to puppet swat [16:04:58] Krenair: ok, let me know [16:05:20] https://gerrit.wikimedia.org/r/#/c/285659/ [16:05:43] 06Operations, 10ops-eqiad: Rack and Set up new application servers mw1261-1307 - https://phabricator.wikimedia.org/T133798#2248927 (10Southparkfan) [16:05:48] 06Operations, 06Performance-Team, 10Traffic, 13Patch-For-Review: Support HTTP/2 - https://phabricator.wikimedia.org/T96848#2248930 (10BBlack) The initial 1-hour run is done, and there didn't seem to be any adverse effects. For the record, this is how the raw results format looks per-host: ``` total: 36039... [16:05:52] https://gerrit.wikimedia.org/r/#/c/283779/ [16:07:21] (03PS9) 10Ottomata: Add new confluent module and puppetization for using confluent Kafka [puppet] - 10https://gerrit.wikimedia.org/r/284349 (https://phabricator.wikimedia.org/T132631) [16:07:50] godog: how do you usually test redirect rules? [16:08:01] 06Operations, 10OCG-General, 05codfw-rollout: Use FQDNs instead of hostnames in the download urls sent to Mediawiki - https://phabricator.wikimedia.org/T133864#2248935 (10mobrovac) >>! In T133864#2248916, @cscott wrote: > The code is just using [`os.hostname`](https://nodejs.org/api/os.html#os_os_hostname) i... [16:08:17] Dereckson: not sure how that gets tested [16:08:25] https://gerrit.wikimedia.org/r/#/c/285654/ [16:08:48] shall I put these on the calendar? [16:08:52] godog: here, the result looks coherent: https://phabricator.wikimedia.org/differential/diff/609/. Perhaps deploy it to mw1017 first, and test w.wiki/foo and w.wiki there? [16:09:01] Krenair: yeah I'm taking a look but please add to the calendar [16:09:12] ok [16:11:42] Dereckson: yeah should be testable in beta too I think, anyways it should get some consensus/checking first [16:11:53] on the code review itself, not on doing it vs not [16:12:10] ok [16:13:41] 06Operations, 06Performance-Team, 10Traffic, 13Patch-For-Review: Support HTTP/2 - https://phabricator.wikimedia.org/T96848#2248953 (10BBlack) In the overall 1H test data, the net result is that when we make the switch: * ~33% of our client connections will upgrade from SPDY to H/2 (which is a very minor i... [16:14:15] (03PS4) 10Filippo Giunchedi: Followup I6b0bbb34: Fix pep8 in modules/diamond/files/collector/powerdns_recursor.py [puppet] - 10https://gerrit.wikimedia.org/r/285659 (owner: 10Alex Monk) [16:14:21] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] Followup I6b0bbb34: Fix pep8 in modules/diamond/files/collector/powerdns_recursor.py [puppet] - 10https://gerrit.wikimedia.org/r/285659 (owner: 10Alex Monk) [16:14:44] phab issues back? [16:15:03] Krenair: your last change to Deployments is missing an { btw, lua error displayed [16:15:26] !log restarting elasticsearch server elastic1008.eqiad.wmnet (T110236) [16:15:27] T110236: Use unicast instead of multicast for node communication - https://phabricator.wikimedia.org/T110236 [16:15:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:15:36] godog, I saw, fixed [16:15:58] hehe nice [16:17:11] !log starting SPDY stats sample on 8x caches for 24H - T96848 [16:17:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:17:35] Krenair: I did eyeball https://gerrit.wikimedia.org/r/#/c/283779/1 though have you tested it for your usecase and a current alarm? [16:20:21] think I did that, yes [16:20:39] hey guys. if i'm wondering as to why /msg nickserv register .. won't work, am i in the right place to enquire? [16:20:53] 06Operations, 06Performance-Team, 10Traffic, 13Patch-For-Review: Support HTTP/2 - https://phabricator.wikimedia.org/T96848#2248957 (10BBlack) (FTR: 24H sample started at 16:17 UTC, but stashbot didn't log it here) [16:21:40] refeez, no [16:21:46] ask in #freenode [16:21:57] cheers [16:22:12] PROBLEM - MariaDB Slave Lag: m3 on db1048 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 359.57 seconds [16:22:46] phab issues are definitelly back [16:22:49] ^ [16:22:53] yeah :( [16:22:57] jynus: on it [16:23:05] bummer [16:23:09] I will downtime the alert [16:23:29] (03PS1) 10Papaul: DHCP: MAC address chang, swaped restbase2007 with restbase2009 Bug:T132976 [puppet] - 10https://gerrit.wikimedia.org/r/285984 (https://phabricator.wikimedia.org/T132976) [16:23:31] can that alert ping me as well? I get pinged for phab issues but not the related db issues [16:23:31] volans I will handle this [16:23:55] ping, as in irc, email? [16:24:26] jynus: we should probably give the upgrade of db1048 to 10 a bit more priority at this point [16:24:30] ok, thanks [16:24:56] 06Operations, 10ops-codfw, 06Analytics-Kanban, 06DC-Ops, and 5 others: setup kafka2001 & kafka2002 - https://phabricator.wikimedia.org/T121558#2248960 (10elukey) [16:25:26] 06Operations, 10ops-codfw, 06Analytics-Kanban, 13Patch-For-Review: rack/setup/deploy conf200[123] - https://phabricator.wikimedia.org/T131959#2248961 (10elukey) [16:25:46] 06Operations, 10ops-codfw, 06Analytics-Kanban, 13Patch-For-Review: rack/setup/deploy conf200[123] - https://phabricator.wikimedia.org/T131959#2184249 (10elukey) [16:25:52] 06Operations, 10ops-codfw, 06Analytics-Kanban, 06DC-Ops, and 5 others: setup kafka2001 & kafka2002 - https://phabricator.wikimedia.org/T121558#1881702 (10elukey) [16:26:09] twentyafterfour, bookmark this for now: https://office.wikimedia.org/wiki/FY_2015-16_Performance_Review (some of those stats are self-explanatory "traffic" "connections" "com_select" [16:26:21] not that [16:26:30] !log further reduced the queue worker count on phabricator, to relieve stress on mysql m3 db1048 [16:26:30] this: https://tendril.wikimedia.org/host/view/db1043.eqiad.wmnet/3306 [16:26:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:27:23] (03PS1) 10Jforrester: Enable VisualEditor by default in SET mode on the Japanese Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/285985 [16:27:58] jynus: thanks [16:28:14] I will ping you personally when I see something strange [16:28:14] (03CR) 10Jforrester: [C: 04-1] "Planned for 9 May." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/285985 (owner: 10Jforrester) [16:28:46] (03CR) 10BBlack: [C: 031] "Also note, great example of pplint arrow-alignment mess :P" [puppet] - 10https://gerrit.wikimedia.org/r/285976 (https://phabricator.wikimedia.org/T76348) (owner: 10Elukey) [16:29:37] Krenair: I'm double checking the graphite change btw [16:29:41] ok [16:34:00] (03PS2) 10Filippo Giunchedi: shinken: Allow undefined data in graphite for disk space checks [puppet] - 10https://gerrit.wikimedia.org/r/283779 (https://phabricator.wikimedia.org/T111540) (owner: 10Alex Monk) [16:34:08] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] shinken: Allow undefined data in graphite for disk space checks [puppet] - 10https://gerrit.wikimedia.org/r/283779 (https://phabricator.wikimedia.org/T111540) (owner: 10Alex Monk) [16:36:00] 06Operations, 06Performance-Team, 10Traffic, 13Patch-For-Review: Support HTTP/2 - https://phabricator.wikimedia.org/T96848#2248985 (10faidon) >>! In T96848#2248953, @BBlack wrote: > I'd say this looks like a fine tradeoff to me. 2% net drop back to H/1, vs moving forward on standards and slightly-improvin... [16:37:44] I haven't seen a single db connection error in the log recently.... I can't reduce things much further though so hopefully it's sustainable at this level [16:38:05] (03PS4) 10Dereckson: Document FIXME statement [mediawiki-config] - 10https://gerrit.wikimedia.org/r/279142 [16:40:09] (03PS1) 10Papaul: DNS: asset tag entries change for resetbase2007 and restbase2009 bug: T132976 [dns] - 10https://gerrit.wikimedia.org/r/285988 [16:41:15] PROBLEM - puppet last run on mw2060 is CRITICAL: CRITICAL: Puppet has 1 failures [16:41:23] godog, looks good [16:41:33] (03PS1) 10Dereckson: Clean expired throttling definitions [mediawiki-config] - 10https://gerrit.wikimedia.org/r/285989 [16:41:34] Krenair: yup, checked on neon too [16:41:39] 06Operations, 10DBA, 13Patch-For-Review: Set up TLS for MariaDB replication - https://phabricator.wikimedia.org/T111654#2248995 (10faidon) Thanks for the update, @volans. Could you also briefly mention and/or update the task description with some numbers (how many slaves are TLS-enabled, how many are left) a... [16:45:31] (03CR) 10Dereckson: "Okay, I'll include this to this evening Puppet SWAT." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/252627 (owner: 10Glaisher) [16:46:08] godog, any thoughts on the labtest one? [16:46:57] (03CR) 10Glaisher: "Not Puppet? :)" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/252627 (owner: 10Glaisher) [16:47:14] (03CR) 10Dereckson: "regular SWAT, indeed" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/252627 (owner: 10Glaisher) [16:47:45] PROBLEM - Redis status tcp_6479 on rdb2006 is CRITICAL: CRITICAL: replication_delay is 692 600 - REDIS on 10.192.48.44:6479 has 1 databases (db0) with 5094257 keys - replication_delay is 692 [16:49:20] 06Operations, 10DBA, 13Patch-For-Review: Set up TLS for MariaDB replication - https://phabricator.wikimedia.org/T111654#2249010 (10jcrespo) Changing the cert would be blocked by doing another failover, as it would require another full restart and would stop replication from working (it cannot be done in a ho... [16:49:47] Krenair: looks fine, in the future don't forget to run the puppet compiler too [16:49:53] (03PS2) 10Dereckson: Revert "Increase abusefilter emergency disable threshold on MediaWiki.org" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/252627 (owner: 10Glaisher) [16:50:02] err, rather, add a link to the puppet compiler in the review [16:50:14] (03CR) 10Dereckson: [C: 031] "PS2: rebased (short array syntax)" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/252627 (owner: 10Glaisher) [16:50:28] * Krenair nods [16:51:31] (03PS4) 10Filippo Giunchedi: Set up Let's Encrypt certificate for labtestwikitech [puppet] - 10https://gerrit.wikimedia.org/r/285654 (https://phabricator.wikimedia.org/T133167) (owner: 10Alex Monk) [16:51:37] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] Set up Let's Encrypt certificate for labtestwikitech [puppet] - 10https://gerrit.wikimedia.org/r/285654 (https://phabricator.wikimedia.org/T133167) (owner: 10Alex Monk) [16:51:42] (03PS1) 10Papaul: adding install params for kafka200[1-2] Bug:T132976 [puppet] - 10https://gerrit.wikimedia.org/r/285994 (https://phabricator.wikimedia.org/T132976) [16:53:02] 06Operations, 10DBA, 13Patch-For-Review: Set up TLS for MariaDB replication - https://phabricator.wikimedia.org/T111654#2249016 (10jcrespo) > we should have a separate task for that BTW, that has not been done because I do not expect to be able to finish this task in less than 6 months (and such a task woul... [16:54:17] (03PS2) 10Papaul: adding install params for restbase200[7-9] Bug:T132976 [puppet] - 10https://gerrit.wikimedia.org/r/285994 (https://phabricator.wikimedia.org/T132976) [16:56:20] 06Operations, 10ops-codfw: rack/setup/deploy restbase200[7-9] - https://phabricator.wikimedia.org/T132976#2249023 (10Papaul) [16:57:03] !log restarting elasticsearch server elastic1009.eqiad.wmnet (T110236) [16:57:04] T110236: Use unicast instead of multicast for node communication - https://phabricator.wikimedia.org/T110236 [16:57:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:59:02] godog, hmm... it didn't like that [16:59:50] 06Operations, 10ops-eqiad, 06Analytics-Kanban: rack/setup/deploy aqs100[456] - https://phabricator.wikimedia.org/T133785#2249043 (10Ottomata) Ok! We discussed partitioning today. We'd like the following: - / a small (30G?) RAID 1 partition on the first 2 drives. - 2 RAID 10 (probably ext4, asking to be su... [17:00:04] gwicke cscott arlolra subbu bearND mdholloway yurik: Respected human, time to deploy Services – Graphoid / Parsoid / OCG / Citoid (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160428T1700). Please do the needful. [17:00:13] deploying kartotherian & tilerator [17:00:30] no deploys [17:00:52] gehel, how are the servers doing btw? :) [17:01:11] which ones :-) ? [17:01:14] :-P [17:01:26] guess! [17:01:35] As far as I can tell, Freenode servers are running great! [17:01:53] * yurik throws a tomato at gehel [17:02:02] Krenair: yeah I'm taking a look, should be easy to fix [17:02:11] godog, I ran puppet again and it worked this time [17:02:15] after restarting apache [17:02:21] (03CR) 10Steinsplitter: [C: 04-1] "We have a lot of powerful filters (such as upload throttles), thus i see no need at all to disable this special configuration." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/285700 (https://phabricator.wikimedia.org/T132930) (owner: 10Bartosz Dziewoński) [17:02:32] Certificate is valid now too [17:02:35] thanks godog [17:02:48] Krenair: heh not 100% fixed in puppet yet but close enough :) [17:03:14] not 100% fixed? what's wrong? [17:03:39] (03CR) 10Steinsplitter: "*at this time." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/285700 (https://phabricator.wikimedia.org/T132930) (owner: 10Bartosz Dziewoński) [17:03:43] Krenair: two things, the include/rewritecond should be in the http vhost not https, and since this is apache2.4 it'll need 'require all granted' in /etc/acme/challenge-apache.conf which I've temporarily commented [17:03:47] yurik: Chris sent me updates about the elasticsearch servers, which are almost fully racked. No update on maps, so I assume they are not yet ready. And he is busy enough that I don't want to harrass him just yet... [17:03:48] s/commented/enabled/ [17:03:51] (03CR) 10Bartosz Dziewoński: "Hm, ten days ago on the linked task you said it'd be fine. The problem with implementing everything as a filter is that tools using the AP" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/285700 (https://phabricator.wikimedia.org/T132930) (owner: 10Bartosz Dziewoński) [17:04:21] gehel, thx :) [17:04:28] * yurik throws gehel a cookie [17:05:34] RECOVERY - Redis status tcp_6479 on rdb2006 is OK: OK: REDIS on 10.192.48.44:6479 has 1 databases (db0) with 5082717 keys - replication_delay is 0 [17:07:32] (03Abandoned) 10Volans: [WIP] DB: Use generated CA for the TLS transition [puppet] - 10https://gerrit.wikimedia.org/r/278052 (https://phabricator.wikimedia.org/T111654) (owner: 10Volans) [17:07:46] (03CR) 10Steinsplitter: "I said it'd be fine? The other api ratelimit stuff is live (deployed) on commons yet? Then i have no problem. :-)" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/285700 (https://phabricator.wikimedia.org/T132930) (owner: 10Bartosz Dziewoński) [17:08:02] 06Operations, 10DBA, 13Patch-For-Review: Set up TLS for MariaDB replication - https://phabricator.wikimedia.org/T111654#2249057 (10faidon) >>! In T111654#2249010, @jcrespo wrote: > Changing the cert would be blocked by doing another failover, as it would require another full restart and would stop replicatio... [17:08:44] RECOVERY - puppet last run on mw2060 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [17:11:34] (03CR) 10Bartosz Dziewoński: "*This* is the stuff :). This is an exact copy of the rules in filter 140. When it's deployed, that filter can be disabled." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/285700 (https://phabricator.wikimedia.org/T132930) (owner: 10Bartosz Dziewoński) [17:12:35] (03PS3) 10Bartosz Dziewoński: Set $wgRateLimits['upload'] for Commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/285700 (https://phabricator.wikimedia.org/T132930) [17:14:47] 06Operations, 10MediaWiki-Parser, 06Parsing-Team, 10Reading-Web-Backlog, and 6 others: Banners fail to show up occassionally on Russian Wikivoyage - https://phabricator.wikimedia.org/T121135#1870420 (10dr0ptp4kt) @wrh2, okay if we close this task now? [17:16:17] godog, I saw some things get moved around by puppet when I ran it earlier, was that your work? [17:16:39] (03PS1) 10Filippo Giunchedi: openstack: move letsencrypt http challenge files to http vhost [puppet] - 10https://gerrit.wikimedia.org/r/286001 (https://phabricator.wikimedia.org/T133167) [17:16:41] Krenair: yeah, ^ [17:17:40] (03CR) 10Alex Monk: [C: 031] openstack: move letsencrypt http challenge files to http vhost [puppet] - 10https://gerrit.wikimedia.org/r/286001 (https://phabricator.wikimedia.org/T133167) (owner: 10Filippo Giunchedi) [17:17:58] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] openstack: move letsencrypt http challenge files to http vhost [puppet] - 10https://gerrit.wikimedia.org/r/286001 (https://phabricator.wikimedia.org/T133167) (owner: 10Filippo Giunchedi) [17:18:13] 06Operations, 10DBA, 13Patch-For-Review: Set up TLS for MariaDB replication - https://phabricator.wikimedia.org/T111654#2249100 (10jcrespo) > we should definitely be prepared to do certificate replacements "lightly" I will ask for a budget to several vendors to implement this functionality, (or should we pr... [17:18:37] (03CR) 10Steinsplitter: [C: 031] "Oh, Now i see. I overlooked that, sorry. Of course this is fine. I also disabled filter 140 now. :-)" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/285700 (https://phabricator.wikimedia.org/T132930) (owner: 10Bartosz Dziewoński) [17:20:25] Krenair: I've updated the ticket but seems ok for now [17:20:41] thanks [17:21:19] (03CR) 10Bartosz Dziewoński: "Hmm, can you turn it back on for now? I'll have this patch deployed on Monday (during "morning SWAT"). There's only one deployment slot le" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/285700 (https://phabricator.wikimedia.org/T132930) (owner: 10Bartosz Dziewoński) [17:22:43] godog: thanks for looking at that. I've been puzzled myself on how to do the challenge conf universally... so maybe I should make it have separate challenge.conf includes for 2.2 and 2.4, where only 2.4 has the graned? [17:22:47] *granted [17:25:41] godog: or maybe, there's a way to do a 2.2 vs 2.4 conditional in apache config directly? [17:26:02] <_joe_> bblack: yes [17:26:13] <_joe_> IfVersion [17:26:20] ah thanks! [17:26:29] <_joe_> e use it on mediawiki [17:26:37] <_joe_> about to be removed [17:27:28] (03CR) 10Muehlenhoff: [C: 032 V: 032] "Built and uploaded to carbon." [debs/varnish] (3.0.6-plus-wm) - 10https://gerrit.wikimedia.org/r/285641 (owner: 10Muehlenhoff) [17:28:14] (03PS1) 10BBlack: LE: apache 2.4 compat for challenge [puppet] - 10https://gerrit.wikimedia.org/r/286004 [17:32:10] nice bblack _joe_ [17:32:19] (03CR) 10Filippo Giunchedi: [C: 031] "LGTM, I'm assuming mod_version is enabled by default" [puppet] - 10https://gerrit.wikimedia.org/r/286004 (owner: 10BBlack) [17:33:56] 06Operations, 10Beta-Cluster-Infrastructure, 06Labs, 10Labs-Infrastructure, and 2 others: Clean up labs graphite datapoints - https://phabricator.wikimedia.org/T111540#2249119 (10Krenair) [17:39:06] (03PS2) 10BBlack: Varnish: don't redirect www.wikimediafoundation.org to m.* [puppet] - 10https://gerrit.wikimedia.org/r/285944 (owner: 10Dereckson) [17:39:48] (03CR) 10BBlack: [C: 032 V: 032] Varnish: don't redirect www.wikimediafoundation.org to m.* [puppet] - 10https://gerrit.wikimedia.org/r/285944 (owner: 10Dereckson) [17:39:50] 06Operations, 10MediaWiki-Parser, 06Parsing-Team, 10Reading-Web-Backlog, and 6 others: Banners fail to show up occassionally on Russian Wikivoyage - https://phabricator.wikimedia.org/T121135#2249139 (10Wrh2) I haven't seen the issue re-occur since the change went live, so I think it should be OK to close t... [17:40:19] !log deployed and restarted kartotherian & tilerator [17:40:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [17:40:36] Hi yurik. Tilerator sounds dinosaurian. [17:40:54] Dereckson, it is! it rips osm db into tiles [17:42:02] (03PS2) 10BBlack: LE: apache 2.4 compat for challenge [puppet] - 10https://gerrit.wikimedia.org/r/286004 [17:42:13] (03CR) 10BBlack: [C: 032 V: 032] LE: apache 2.4 compat for challenge [puppet] - 10https://gerrit.wikimedia.org/r/286004 (owner: 10BBlack) [17:43:16] RECOVERY - MariaDB Slave IO: s3 on dbstore1001 is OK: OK slave_io_state Slave_IO_Running: Yes [17:44:26] 06Operations, 06Performance-Team, 10Traffic, 13Patch-For-Review: Support HTTP/2 - https://phabricator.wikimedia.org/T96848#2249142 (10ori) >>! In T96848#2248985, @faidon wrote: >>>! In T96848#2248953, @BBlack wrote: >> I'd say this looks like a fine tradeoff to me. 2% net drop back to H/1, vs moving forwa... [17:44:27] starting db1038, ^that could flop a bit [17:46:16] already finished? cool! [17:46:34] PROBLEM - Redis status tcp_6479 on rdb2006 is CRITICAL: CRITICAL: replication_delay is 696 600 - REDIS on 10.192.48.44:6479 has 1 databases (db0) with 5088992 keys - replication_delay is 696 [17:47:41] 06Operations, 10ops-eqiad, 06Analytics-Kanban: rack/setup/deploy aqs100[456] - https://phabricator.wikimedia.org/T133785#2249164 (10Ottomata) If it is easier to put the `/` partition RAID1 across the first 4 drives, that is fine too. [17:49:31] 06Operations, 10DBA, 13Patch-For-Review: Set up TLS for MariaDB replication - https://phabricator.wikimedia.org/T111654#2249183 (10Volans) The problem here is that MySQL/MariaDB support of SSL/TLS is still pretty simple: - all SSL related stuff are loaded at startup only, there is no dynamic way to reload a... [17:51:19] (03PS2) 10BBlack: update-ocsp: time validity check bugfix [puppet] - 10https://gerrit.wikimedia.org/r/285072 [17:51:26] (03CR) 10BBlack: [C: 032 V: 032] update-ocsp: time validity check bugfix [puppet] - 10https://gerrit.wikimedia.org/r/285072 (owner: 10BBlack) [17:52:19] 06Operations, 10DBA, 13Patch-For-Review: Set up TLS for MariaDB replication - https://phabricator.wikimedia.org/T111654#2249186 (10jcrespo) > Because the "easier" way that I see to achieve a "light" certificate replacement for MySQL is to contribute upstream implementing the feature. Actually, that was my p... [17:52:55] (03CR) 10Ottomata: [C: 032] Enable kafka200[12] to host Kafka and EventBus. [puppet] - 10https://gerrit.wikimedia.org/r/285958 (https://phabricator.wikimedia.org/T121558) (owner: 10Elukey) [17:53:15] 06Operations, 10MediaWiki-Parser, 06Parsing-Team, 10Reading-Web-Backlog, and 6 others: Banners fail to show up occassionally on Russian Wikivoyage - https://phabricator.wikimedia.org/T121135#2249193 (10Jdlrobson) 05stalled>03Resolved Thanks @Wrh2 for confirming. Please do reopen this bug if it reoccurs... [17:53:38] (03PS4) 10BBlack: Split mobile text cache for lazy loaded references testing [puppet] - 10https://gerrit.wikimedia.org/r/284576 (https://phabricator.wikimedia.org/T129693) (owner: 10Jdlrobson) [17:54:11] (03CR) 10BBlack: [C: 032 V: 032] Split mobile text cache for lazy loaded references testing [puppet] - 10https://gerrit.wikimedia.org/r/284576 (https://phabricator.wikimedia.org/T129693) (owner: 10Jdlrobson) [17:54:22] (03PS2) 10BBlack: Remove NetSpeedB cache splitting [puppet] - 10https://gerrit.wikimedia.org/r/284577 (owner: 10Jdlrobson) [17:55:43] (03CR) 10BBlack: [C: 032 V: 032] Remove NetSpeedB cache splitting [puppet] - 10https://gerrit.wikimedia.org/r/284577 (owner: 10Jdlrobson) [17:56:38] 06Operations, 10ops-eqiad, 06Analytics-Kanban: rack/setup/deploy aqs100[456] - https://phabricator.wikimedia.org/T133785#2249206 (10RobH) So I've chatted with @ottomatta about this in IRC. Setting up this suggestion: |/|sda1, sddb1|radi1 |/var/lib/cassandra/a|sda2.sdb2, sdc1, sdd1|raid10 |/var/lib/cassandr... [17:58:28] paravoid: lvs1012...sfp+ swap appears to have corrected the issue [17:59:03] heh [17:59:46] 06Operations, 10ops-eqiad, 10Traffic: lvs1012 eth1 NIC Link flapping - https://phabricator.wikimedia.org/T133915#2249217 (10Cmjohnson) Swapped the spf+ asw=a5 xe-0/0/13....appears to have fixed the issue [17:59:46] shades of https://phabricator.wikimedia.org/T112781 + https://phabricator.wikimedia.org/T104458 ? [18:00:16] different switch, but once again suspecting SPF's on those new LVS boxes [18:02:51] godog, whoops, did you end up deploying https://gerrit.wikimedia.org/r/#/c/285765/? Indeed it is fine to deploy in production (I did for the prior one already), since it has no effect there. [18:04:06] 06Operations, 10ops-eqiad, 10Traffic: lvs1012 eth1 NIC Link flapping - https://phabricator.wikimedia.org/T133915#2249236 (10MoritzMuehlenhoff) 05Open>03Resolved This was caused by T133915, apt and puppet run working fine now. Closing. [18:04:10] 06Operations, 10Traffic, 06WMF-Legal, 10domains: wikipedia.lol - https://phabricator.wikimedia.org/T88861#2249238 (10ZhouZ) [18:06:29] 06Operations: lvs1012 - puppet fail, tcpdump package cannot be authenticated - https://phabricator.wikimedia.org/T133832#2249242 (10MoritzMuehlenhoff) 05Open>03Resolved a:03MoritzMuehlenhoff This was caused by T133915, apt and puppet run working fine now. [18:06:47] 06Operations, 10ops-eqiad, 10Traffic: lvs1012 eth1 NIC Link flapping - https://phabricator.wikimedia.org/T133915#2248747 (10MoritzMuehlenhoff) Wrong tab.. I meant to close T133832 [18:07:01] 06Operations, 10CirrusSearch, 06Discovery, 06Discovery-Search-Backlog, and 3 others: upgrade HHVM to 3.12.1 on terbium (Elastica: missing curl_init_pooled method.) - https://phabricator.wikimedia.org/T132751#2249251 (10EBernhardson) I'm unsure about changing everyone's cli scripts to hhvm. Mostly I'm not s... [18:07:50] (03PS1) 10Cmjohnson: Adding mgmt and production dns entries for labstore1004 and labstore1005 [dns] - 10https://gerrit.wikimedia.org/r/286010 [18:08:29] !log mattflaschen@tin Synchronized wmf-config/db-labs.php: Beta Cluster change (duration: 00m 37s) [18:08:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [18:11:25] jouncebot: next [18:11:25] In 0 hour(s) and 48 minute(s): MediaWiki train (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160428T1900) [18:11:40] Hah, I thought the train was usually at 11am, not noon? [18:11:57] godog, I synced out the Labs change just in case. [18:12:43] matt_flaschen: thanks! yeah I fetched + ff but didn't sync-file since it wasn't affecting production [18:14:00] ostriches: Is it OK if I cherry-pick+deploy https://gerrit.wikimedia.org/r/286005 to wmf22 to prevent cross-wiki notifs from breaking for group2 people? [18:14:21] (Most users of this feature are on Wikipedias) [18:14:28] (without looking) [18:14:29] Yes [18:14:34] haha OK thanks [18:14:48] * ostriches goes back to bed in stuffy haze [18:15:06] PROBLEM - puppet last run on ms-be1016 is CRITICAL: CRITICAL: Puppet has 1 failures [18:16:21] Maybe someone can update the interwiki cache? we got a new interwiki URL [18:19:15] !log restarting elasticsearch server elastic1010.eqiad.wmnet (T110236) [18:19:16] T110236: Use unicast instead of multicast for node communication - https://phabricator.wikimedia.org/T110236 [18:19:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [18:20:49] 06Operations, 10CirrusSearch, 06Discovery, 06Discovery-Search-Backlog, and 3 others: upgrade HHVM to 3.12.1 on terbium (Elastica: missing curl_init_pooled method.) - https://phabricator.wikimedia.org/T132751#2249310 (10EBernhardson) [18:40:24] RECOVERY - puppet last run on ms-be1016 is OK: OK: Puppet is currently enabled, last run 36 seconds ago with 0 failures [18:41:10] jdlrobson: You want 285559 on what wmf branches? Are currently in use 1.27.0-wmf.21 and 1.27.0-wmf.22 [18:42:35] !log catrope@tin Synchronized php-1.27.0-wmf.22/extensions/Echo/: Fix fatal T133921 (duration: 00m 32s) [18:42:36] T133921: Exception on MediaWiki.org on mobile web - https://phabricator.wikimedia.org/T133921 [18:42:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [18:45:48] (03CR) 10Alex Monk: "This broken the script in labtest because that project does not exist. Instead of continuing to expand this list, you should review Icbb35" [puppet] - 10https://gerrit.wikimedia.org/r/280768 (owner: 10Yuvipanda) [18:46:37] RECOVERY - Redis status tcp_6479 on rdb2006 is OK: OK: REDIS on 10.192.48.44:6479 has 1 databases (db0) with 5093633 keys - replication_delay is 0 [18:47:39] (03PS8) 10Eevans: [WIP]: Cassandra 2.2.5 config [puppet] - 10https://gerrit.wikimedia.org/r/284078 (https://phabricator.wikimedia.org/T126629) [18:54:08] jouncebot: [18:54:15] jouncebot: GIVE ME DA SCHEDULE BITTE [18:55:43] (03PS1) 10Hashar: all wikis to 1.27.0-wmf.22 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/286014 [18:56:26] (03PS13) 10Alex Monk: labs dnsrecursor IP aliasing: work on all projects, not just some arbitrary ones [puppet] - 10https://gerrit.wikimedia.org/r/268921 [18:57:32] (03CR) 10Andrew Bogott: [C: 032] labs dnsrecursor IP aliasing: work on all projects, not just some arbitrary ones [puppet] - 10https://gerrit.wikimedia.org/r/268921 (owner: 10Alex Monk) [18:57:56] (03CR) 10Hashar: [C: 032] all wikis to 1.27.0-wmf.22 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/286014 (owner: 10Hashar) [18:58:21] (03Merged) 10jenkins-bot: all wikis to 1.27.0-wmf.22 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/286014 (owner: 10Hashar) [18:58:34] sleep(60) [18:59:16] we still have the usual spam of "Could not connect to server "{redis_server}"" ... :( [19:00:05] hashar ostriches: Dear anthropoid, the time has come. Please deploy MediaWiki train (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160428T1900). [19:00:22] !log hashar@tin rebuilt wikiversions.php and synchronized wikiversions files: all wikis to 1.27.0-wmf.22 [19:00:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [19:03:36] PROBLEM - puppet last run on holmium is CRITICAL: CRITICAL: Puppet has 1 failures [19:04:16] PROBLEM - puppet last run on labservices1001 is CRITICAL: CRITICAL: Puppet has 1 failures [19:10:05] 06Operations, 10ops-eqiad: Rack and Setup new elastic search - https://phabricator.wikimedia.org/T133772#2249475 (10Cmjohnson) [19:10:22] 06Operations, 10ops-eqiad: Rack and Setup new elastic search - https://phabricator.wikimedia.org/T133772#2242648 (10Cmjohnson) Racked in the following locations Row/Rack A3 elastic1032 elastic1033 elastic1034 elastic1035 Row/Rack B3 (Welcome search to row B) elastic1036 elastic1037 elastic1038 elastic1039 R... [19:10:42] !log 1.27.0-wmf.22 deployed. Uneventful. [19:10:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [19:10:49] so feel free to deploy. [19:10:52] ostriches: 1.27.0-wmf.22 done ;-} [19:12:26] 06Operations, 10ops-eqiad: Rack and Setup new elastic search - https://phabricator.wikimedia.org/T133772#2249498 (10Cmjohnson) Having an issue with installations. The hit TFTP okay and "install" but are in a continious pxe loop. I checked bios and they pxe is the last option. I tried to force a boot to HDD b... [19:15:32] !log manually rotating db1038's error log [19:15:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [19:17:44] PROBLEM - Unmerged changes on repository mediawiki_config on mira is CRITICAL: There is one unmerged change in mediawiki_config (dir /srv/mediawiki-staging/). [19:19:06] RECOVERY - puppet last run on holmium is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [19:19:45] RECOVERY - puppet last run on labservices1001 is OK: OK: Puppet is currently enabled, last run 50 seconds ago with 0 failures [19:19:54] (03PS1) 10Andrew Bogott: Include python-keystoneclient along with the labsaliaser. [puppet] - 10https://gerrit.wikimedia.org/r/286017 [19:21:53] 06Operations, 10MediaWiki-Parser, 06Parsing-Team, 10Reading-Web-Backlog, and 5 others: Banners fail to show up occassionally on Russian Wikivoyage - https://phabricator.wikimedia.org/T121135#2249524 (10bmansurov) [19:23:58] (03CR) 10Brian Wolff: [C: 031] GoogleNewsSitemap configuration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/285927 (https://phabricator.wikimedia.org/T39608) (owner: 10Dereckson) [19:32:55] PROBLEM - Host snapshot1007 is DOWN: PING CRITICAL - Packet loss = 100% [19:34:07] (03CR) 10Alex Monk: [C: 031] Include python-keystoneclient along with the labsaliaser. [puppet] - 10https://gerrit.wikimedia.org/r/286017 (owner: 10Andrew Bogott) [19:35:05] (03PS1) 10BBlack: Fix GeoIP cookie domain scope for beta cluster [puppet] - 10https://gerrit.wikimedia.org/r/286019 (https://phabricator.wikimedia.org/T133936) [19:37:00] seeing this intermittently on phabricator page loads: [19:37:00] AphrontConnectionQueryException [19:37:01] Attempt to connect to phuser@m3-master.eqiad.wmnet failed with error #2003: Can't connect to MySQL server on 'm3-master.eqiad.wmnet' (99). [19:38:09] bblack thats due to all the importing [19:38:18] ok [19:38:21] We are importing open changes from gerrit [19:38:43] twentyafterfour tryed to reduce the daemons as much as possible. [19:39:07] 500,000+ [19:40:26] there is like a ticket, cannot remember [19:40:55] what if I get the exception while trying to load the ticket about the exception? :P [19:41:07] phabception [19:41:25] ah, you do not read tickets in pure SQL? [19:41:29] noob [19:41:56] just don't google 'google' and we'll be ok [19:42:12] :) [19:47:43] !log restarting elasticsearch server elastic1011.eqiad.wmnet (T110236) [19:47:43] T110236: Use unicast instead of multicast for node communication - https://phabricator.wikimedia.org/T110236 [19:47:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [19:48:58] (03PS1) 10Eevans: descriptors should be world readable [puppet] - 10https://gerrit.wikimedia.org/r/286020 [19:50:36] (03CR) 10Ottomata: [C: 031] "Cool, I think this should be fine! Let's def watch for blocked traffic after this is merged to make sure we didn't miss anything." [puppet] - 10https://gerrit.wikimedia.org/r/285904 (owner: 10Muehlenhoff) [19:55:12] 06Operations, 10CirrusSearch, 06Discovery, 06Discovery-Search-Backlog, and 3 others: upgrade HHVM to 3.12.1 on terbium (Elastica: missing curl_init_pooled method.) - https://phabricator.wikimedia.org/T132751#2249640 (10Gehel) As I see it there isn't an actual issue. We log warnings, but we still get the jo... [19:59:23] (03CR) 10Muehlenhoff: "Yep. I'll merge this tomorrow morning along with some iptables logging to catch any potential omissions." [puppet] - 10https://gerrit.wikimedia.org/r/285904 (owner: 10Muehlenhoff) [20:04:26] (03PS3) 10ArielGlenn: Raise connection limit for dumps server, for specific shared IP [puppet] - 10https://gerrit.wikimedia.org/r/285682 (https://phabricator.wikimedia.org/T133790) [20:08:37] 06Operations, 10CirrusSearch, 06Discovery, 06Discovery-Search-Backlog, and 3 others: "Elastica: missing curl_init_pooled method" due to mwscript job running with PHP 5 on terbium - https://phabricator.wikimedia.org/T132751#2249672 (10hashar) [20:09:09] 06Operations, 10CirrusSearch, 06Discovery, 06Discovery-Search-Backlog, and 3 others: "Elastica: missing curl_init_pooled method" due to mwscript job running with PHP 5 on terbium - https://phabricator.wikimedia.org/T132751#2209239 (10hashar) I have clarified the task description, it is no more about upgrad... [20:09:16] (03PS1) 10Alex Monk: deployment-prep shinken: Remove old HHVM queue size check [puppet] - 10https://gerrit.wikimedia.org/r/286023 [20:09:21] 06Operations, 10ops-eqiad, 06Analytics-Kanban: rack/setup/deploy aqs100[456] - https://phabricator.wikimedia.org/T133785#2249676 (10RobH) Ok, old comment was wrong, had bad disk info. New suggestion: |mount|disks|raid level|size |/|sda1,sdb1, sdc1, sdd1 |raid10|50GB |/var/lib/cassandra/a|sda2.sdb2, sdc2, sd... [20:10:44] 06Operations, 10ops-eqiad, 06Analytics-Kanban: rack/setup/deploy aqs100[456] - https://phabricator.wikimedia.org/T133785#2249677 (10Ottomata) +1, makes sense. Thank you! [20:18:00] 06Operations, 10ops-eqiad, 06Analytics-Kanban: rack/setup/deploy aqs100[456] - https://phabricator.wikimedia.org/T133785#2249692 (10Cmjohnson) [20:19:11] (03CR) 10BryanDavis: [C: 031] "Thanks for looking at this so quickly bblack." [puppet] - 10https://gerrit.wikimedia.org/r/286019 (https://phabricator.wikimedia.org/T133936) (owner: 10BBlack) [20:20:00] 06Operations, 10Continuous-Integration-Config, 13Patch-For-Review: Switch CI from jsduck deb package to a gemfile/bundler system - https://phabricator.wikimedia.org/T109005#2249695 (10cscott) @Krinkle +1 Adding `bundle install jsduck` to node's `predoc` target is emphatically *not* the right way to do this. [20:20:15] 06Operations, 10ops-eqiad, 06Analytics-Kanban: rack/setup/deploy aqs100[456] - https://phabricator.wikimedia.org/T133785#2249697 (10Cmjohnson) Racked one each in A4, C5, D4 [20:25:35] PROBLEM - Redis status tcp_6479 on rdb2006 is CRITICAL: CRITICAL: replication_delay is 708 600 - REDIS on 10.192.48.44:6479 has 1 databases (db0) with 5119352 keys - replication_delay is 708 [20:30:36] (03CR) 10Nuria: [C: 031] Return a custom HTTP 503 response for all the stat1001 websites due to maintenance. [puppet] - 10https://gerrit.wikimedia.org/r/285976 (https://phabricator.wikimedia.org/T76348) (owner: 10Elukey) [20:44:09] oh, is phabricator broken? [20:44:31] Attempt to connect to phuser@m3-master.eqiad.wmnet failed with error #2003 [20:45:03] now it's back [20:45:14] or kind of still broken :/ [20:47:36] aude phabricator is currently importing over 500,000 thousond changes. [20:48:07] twentyafterfour: Tryed to reduce daemons as much as he could. [20:48:13] It is importing open gerrit changes. [20:50:00] o_O [20:52:50] (03CR) 10BBlack: [C: 032] Fix GeoIP cookie domain scope for beta cluster [puppet] - 10https://gerrit.wikimedia.org/r/286019 (https://phabricator.wikimedia.org/T133936) (owner: 10BBlack) [20:56:13] RECOVERY - Redis status tcp_6479 on rdb2006 is OK: OK: REDIS on 10.192.48.44:6479 has 1 databases (db0) with 5113423 keys - replication_delay is 0 [21:08:19] !log restarting elasticsearch server elastic1012.eqiad.wmnet (T110236) [21:08:20] T110236: Use unicast instead of multicast for node communication - https://phabricator.wikimedia.org/T110236 [21:08:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [21:09:50] (03PS2) 10Andrew Bogott: Include python-keystoneclient along with the labsaliaser. [puppet] - 10https://gerrit.wikimedia.org/r/286017 [21:12:28] aude: that error was happening a lot yesterday, but today seems to be less [21:12:41] (03CR) 10Andrew Bogott: [C: 032] Include python-keystoneclient along with the labsaliaser. [puppet] - 10https://gerrit.wikimedia.org/r/286017 (owner: 10Andrew Bogott) [21:13:10] wait, it's back... looks like I need to reopen that issue on phabricator describing it! oh wait... [21:14:29] 06Operations, 10Phabricator: Database errors using phabricator - https://phabricator.wikimedia.org/T133826#2249813 (10Smalyshev) Still happening from time to time today. [21:19:24] SMalyshev: Just saw your email about missing data in dumps... [21:19:41] It seems that every step is fighting us on this one... [21:19:44] gehel: yeah. there's a problem there somewhere :( [21:20:05] still don't know what's wrong, but dump of 04-11 is ok, so I'll be loading from it [21:20:37] gehel: what's the story with the disks for wdqs1001 - was it figured out? [21:21:26] SMalyshev: probably a mix up between disks when adding them to the server. We should have other disks available... [21:21:59] Chris is fairly busy racking all the new servers. Let me ping him to make sure he is aware of the issue... [21:22:15] gehel: ok, so I'll reload wdqs1002 today and will switch it on from maintenance, and will put 1001 into maintenance, because now it has bad data [21:22:26] gehel: and then we can reimage it [21:22:50] that story turns out bigger can of worms than I expected... [21:23:09] Oh yeah! That should have been simple... [21:23:29] right. hopefully we're learning some lessons so next time will be easier [21:23:48] once we figure out what broken the dumps... [21:24:22] broken dumps, there might be a lesson. Wrong disks... not sure what can be done. Mistake happens... [21:25:53] I'm getting intermittent database errors from phabricator -- "Attempt to connect to phuser@m3-master.eqiad.wmnet failed with error #2003: Can't connect to MySQL server on 'm3-master.eqiad.wmnet' (99)." [21:26:18] 4-5 in the last 5 minutes or so [21:32:00] the mysql errors for phab don't seem to be getting any better [21:32:09] bd808: there is a massive import of commits / Gerrit changes going on [21:32:19] that seems to overload the db connection pool [21:32:39] !log reduced phabricator taskmaster processes to 1 [21:32:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [21:32:45] magic [21:34:07] bd808: resolved? [21:34:28] twentyafterfour: strong maybe. haven't had an error since you did that [21:34:42] doh. just got another [21:38:16] I literally can't reduce it any further. 1 process shouldn't be able to overwhelm the database like that [21:41:30] !log added usleep(200000); to slow down the phabricator import even further. [21:41:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [21:42:41] twentyafterfour: is teh import job doing different stuff from earlier today? [21:42:50] I saw different patterns on tendril graphs [21:43:18] (03PS9) 10Eevans: [WIP]: Cassandra 2.2.5 config [puppet] - 10https://gerrit.wikimedia.org/r/284078 (https://phabricator.wikimedia.org/T126629) [21:43:18] volans: it shouldn't be different [21:44:41] see https://tendril.wikimedia.org/host/view/db1043.eqiad.wmnet/3306 [21:45:02] and compare with the graphs on the left side that are last 7 days ;) [21:45:31] I've been lowering the concurrency all day [21:45:38] gradually, as people keep complaining [21:45:48] (03PS1) 10BBlack: note future anycast networks [dns] - 10https://gerrit.wikimedia.org/r/286066 (https://phabricator.wikimedia.org/T98006) [21:46:17] although I have to be honest I see aborted clients all the way back to 1 week, just 50% less than since the import started [21:46:46] 06Operations, 10Ops-Access-Requests: Requesting access to stat1003, stat1002 and bast1001 for Dstrine - https://phabricator.wikimedia.org/T133953#2250003 (10DStrine) [21:47:29] (03PS1) 10Cscott: Use FQDN for OCG hostnames. [puppet] - 10https://gerrit.wikimedia.org/r/286068 (https://phabricator.wikimedia.org/T133864) [21:47:51] 06Operations, 10Ops-Access-Requests: Requesting access to stat1003, stat1002 and bast1001 for Dstrine - https://phabricator.wikimedia.org/T133953#2250020 (10DStrine) I just signed L3 [21:48:19] (03PS10) 10Eevans: [WIP]: Cassandra 2.2.5 config [puppet] - 10https://gerrit.wikimedia.org/r/284078 (https://phabricator.wikimedia.org/T126629) [21:48:24] 06Operations, 10OCG-General, 13Patch-For-Review, 05codfw-rollout: Use FQDNs instead of hostnames in the download urls sent to Mediawiki - https://phabricator.wikimedia.org/T133864#2250021 (10cscott) I think I figured it out (see above patch). Warning: completely untested! Someone who knows what they are... [21:48:28] 06Operations, 10Traffic, 10netops, 13Patch-For-Review: Anycast (Auth)DNS - https://phabricator.wikimedia.org/T98006#2250022 (10BBlack) Actually, amending the network thoughts above: should use 198.35.27.0/24 + 185.15.5**8**.0/24. Using 58 instead of 56 leaves us a contiguous /23 for more future flexibilit... [21:48:33] 06Operations, 10Ops-Access-Requests: Requesting access to stat1003, stat1002 and bast1001 for Dstrine - https://phabricator.wikimedia.org/T133953#2250023 (10DStrine) [21:49:10] (03CR) 10Cscott: "I hope this is the right way to do it. Completely untested!" [puppet] - 10https://gerrit.wikimedia.org/r/286068 (https://phabricator.wikimedia.org/T133864) (owner: 10Cscott) [21:49:17] (03PS2) 10BBlack: note future anycast networks [dns] - 10https://gerrit.wikimedia.org/r/286066 (https://phabricator.wikimedia.org/T98006) [21:49:46] phabricator uses a massive number of connections, unfortunately, it easily exhausts the limit [21:51:26] yes [21:52:53] (03PS3) 10Dzahn: Restore garfieldairlines.net feed on fr.planet [puppet] - 10https://gerrit.wikimedia.org/r/285949 (owner: 10Dereckson) [21:56:00] (03PS1) 10Cscott: Decommission ocg1003. [puppet] - 10https://gerrit.wikimedia.org/r/286070 (https://phabricator.wikimedia.org/T84723) [21:56:34] (03CR) 10Dzahn: [C: 032] Restore garfieldairlines.net feed on fr.planet [puppet] - 10https://gerrit.wikimedia.org/r/285949 (owner: 10Dereckson) [21:56:59] PROBLEM - Host 208.80.154.20 is DOWN: PING CRITICAL - Packet loss = 100% [21:57:08] PROBLEM - Host labs-ns1.wikimedia.org is DOWN: PING CRITICAL - Packet loss = 100% [21:58:57] 06Operations, 10OCG-General, 06Services, 13Patch-For-Review: Implement flag to tell an OCG machine not to take new tasks from the redis task queue - https://phabricator.wikimedia.org/T120077#2250063 (10cscott) I attempted a puppet patch to use this functionality in https://gerrit.wikimedia.org/r/286070... [21:59:19] ^this is probably me and I'm looking at why it's failing [22:01:36] !log reboot of holmium [22:01:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [22:02:26] RECOVERY - Host 208.80.154.20 is UP: PING OK - Packet loss = 0%, RTA = 1.42 ms [22:03:16] RECOVERY - Host labs-ns1.wikimedia.org is UP: PING OK - Packet loss = 0%, RTA = 11.37 ms [22:07:46] (03PS2) 10Dzahn: DNS: asset tag entries change for resetbase2007 and restbase2009 bug: T132976 [dns] - 10https://gerrit.wikimedia.org/r/285988 (owner: 10Papaul) [22:07:51] (03CR) 10Dzahn: [C: 032] DNS: asset tag entries change for resetbase2007 and restbase2009 bug: T132976 [dns] - 10https://gerrit.wikimedia.org/r/285988 (owner: 10Papaul) [22:09:38] (03PS2) 10Dzahn: DHCP: MAC address chang, swaped restbase2007 with restbase2009 Bug:T132976 [puppet] - 10https://gerrit.wikimedia.org/r/285984 (https://phabricator.wikimedia.org/T132976) (owner: 10Papaul) [22:09:49] 06Operations, 06Discovery, 03Discovery-Search-Sprint, 07Elasticsearch, 13Patch-For-Review: Use unicast instead of multicast for node communication - https://phabricator.wikimedia.org/T110236#2250105 (10EBernhardson) [22:09:51] (03CR) 10Dzahn: [C: 032] DHCP: MAC address chang, swaped restbase2007 with restbase2009 Bug:T132976 [puppet] - 10https://gerrit.wikimedia.org/r/285984 (https://phabricator.wikimedia.org/T132976) (owner: 10Papaul) [22:19:56] PROBLEM - puppet last run on mw1189 is CRITICAL: CRITICAL: Puppet has 1 failures [22:37:55] PROBLEM - puppet last run on ms-fe2002 is CRITICAL: CRITICAL: Puppet has 1 failures [22:38:26] PROBLEM - puppet last run on mw2156 is CRITICAL: CRITICAL: puppet fail [22:39:55] 07Blocked-on-Operations, 06Operations, 06Increasing-content-coverage, 06Research-and-Data-Backlog: Backport python3-sklearn and python3-sklearn-lib from sid - https://phabricator.wikimedia.org/T133362#2250197 (10ggellerman) [22:46:35] RECOVERY - puppet last run on mw1189 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [22:55:05] PROBLEM - aqs endpoints health on aqs1003 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [22:57:40] MatmaRex: around? [22:58:37] yeah [22:59:27] I've CR+2 your change, so Zuul can start Jenkins jobs and it will be ready for SWAT start. [23:00:04] RoanKattouw ostriches Krenair MaxSem awight Dereckson: Dear anthropoid, the time has come. Please deploy Evening SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160428T2300). [23:00:05] Dereckson jdlrobson MatmaRex: A patch you scheduled for Evening SWAT (Max 8 patches) is about to be deployed. Please be available during the process. [23:00:20] * James_F just threw in an extra patch, sorry. [23:00:40] I'll do it [23:00:44] * James_F makes a core pull-through for it. [23:00:55] RECOVERY - aqs endpoints health on aqs1003 is OK: All endpoints are healthy [23:01:52] MaxSem: could you start by MatmaRex one? I offered him together to start with 286016 as it's late in his timezone. [23:02:09] MaxSem: yeah, i was just going to ask ^ :) [23:02:27] and it's already merged! [23:02:56] \o [23:03:23] I moved my things to next week because I saw this was full [23:03:38] actually, one thing [23:04:00] Krenair: if there is any emergency, I cam report the no-op throttle [23:04:45] RECOVERY - puppet last run on ms-fe2002 is OK: OK: Puppet is currently enabled, last run 21 seconds ago with 0 failures [23:05:41] don't we usually just purge expired throttle exemptions when we next go to edit the file? instead of extra commits for that [23:06:02] grrrrr [23:06:04] rrrr [23:06:12] one host is hanging [23:06:16] RECOVERY - Unmerged changes on repository mediawiki_config on mira is OK: No changes to merge. [23:06:43] * MatmaRex hugs MaxSem [23:07:14] RECOVERY - puppet last run on mw2156 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [23:07:15] !log maxsem@tin Synchronized php-1.27.0-wmf.22/extensions/UploadWizard/: https://gerrit.wikimedia.org/r/#q,286016,n,z (duration: 02m 34s) [23:07:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:08:04] MatmaRex, ^ [23:08:07] I couldn [23:08:21] 't care about snapshot hosts [23:08:38] MaxSem: thanks, looks fixed [23:09:52] MaxSem: https://gerrit.wikimedia.org/r/#/c/286092/ is the VE one. [23:10:21] MaxSem: And https://gerrit.wikimedia.org/r/#/c/286093/ fixes .gitreview to actually point to the branch and not master. Ahem. :-) [23:10:39] (good night folks) [23:10:41] James_F, don't see it on the page [23:11:27] MaxSem: https://gerrit.wikimedia.org/r/#/c/285769/ is the one in VE; because of the broken state of how that works I have to do manual pull-throughs to MW deployment branches, which is https://gerrit.wikimedia.org/r/#/c/286092/ [23:11:58] !log maxsem@tin Synchronized php-1.27.0-wmf.22/extensions/VisualEditor/: https://gerrit.wikimedia.org/r/#q,285769,n,z (duration: 02m 34s) [23:12:00] good night MatmaRex [23:12:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:12:29] do we have a bug in branching scripts? [23:13:15] MaxSem: You mean, why did I have to fix .gitreview? Yeah; see also b46a5df which was a different follow-up to 6f03a10. [23:13:51] jdlrobson, yt? [23:14:00] yup [23:16:53] 1Log OS installation on restbase200[7-9] [23:17:23] !log maxsem@tin Synchronized php-1.27.0-wmf.22/extensions/WikidataPageBanner/: https://gerrit.wikimedia.org/r/286018 (duration: 02m 29s) [23:17:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:17:40] jdlrobson, ^ [23:17:51] MaxSem: testing [23:19:32] MaxSem: perfect [23:21:45] MaxSem: For the VE one, you just need to sync extensions/VisualEditor/lib/ve/src/ce/ve.ce.BranchNode.js [23:22:15] bzzt! slash and dot overflow! [23:22:20] Tsk. [23:24:34] "Social_enterprise is not a valid UUID" ... is anyone ever going to fix that kind of exception spam some day? [23:26:07] !log maxsem@tin Synchronized php-1.27.0-wmf.22/extensions/VisualEditor/: (no message) (duration: 02m 30s) [23:26:13] James_F, ^ [23:26:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:26:20] are we still doing swat? [23:26:26] aude: yup [23:26:26] Yes. [23:26:34] aude: MaxSem is swatting [23:26:44] can i sneak a patch in (or otherwise do it myself after) https://gerrit.wikimedia.org/r/#/c/286087/ [23:26:50] MaxSem: Yup, fixed. [23:27:19] aude, wmf21 is not live anymore [23:27:20] Krenair: yes, https://phabricator.wikimedia.org/rOMWC9c420567456c8bc0c2b996bea4a0d9faa2d4305c [23:27:30] it's wmf21 wikidata [23:27:36] + wmf22 core [23:27:37] MaxSem: is the config change happening later or did you miss that? [23:27:48] so i don't know if the automagical submodule updating will work :/ [23:27:53] now let's do config [23:28:06] anyone has any preferences on the order? [23:28:18] i'm available for mine whenever :) [23:30:58] (03CR) 10MaxSem: [C: 032] Document FIXME statement [mediawiki-config] - 10https://gerrit.wikimedia.org/r/279142 (owner: 10Dereckson) [23:31:34] this one is no op [23:31:48] (03CR) 10MaxSem: [C: 032] Clean expired throttling definitions [mediawiki-config] - 10https://gerrit.wikimedia.org/r/285989 (owner: 10Dereckson) [23:33:07] (03CR) 10MaxSem: [C: 032] Allow wmf-config/throttle.php to be lenient on ip/IP typo [mediawiki-config] - 10https://gerrit.wikimedia.org/r/280865 (https://phabricator.wikimedia.org/T131469) (owner: 10Dereckson) [23:34:56] (03PS5) 10MaxSem: Document FIXME statement [mediawiki-config] - 10https://gerrit.wikimedia.org/r/279142 (owner: 10Dereckson) [23:35:21] fuck Project policy requires all submissions to be a fast-forward. [23:35:22] Please rebase the change locally and upload again for review. [23:35:56] :p we went through the same change with ops/puppet [23:36:07] and the point is? [23:37:12] it prevents a possible fuckup. before if unlucky about the timing it could get in an inconsistent state [23:37:55] don't remeber a single incident about it in mediawiki-config [23:38:07] it happened once with ops/puppet, then it got changed [23:39:12] config repo is different [23:39:39] it's mostly lines of settings [23:40:18] one advantage: it could speed up git revert if something is broken [23:40:32] to revert a linear history is faster than revert a merged change [23:41:55] and I'm still waiting for a single patchset to merge and only then can start with another one [23:42:10] scew it, I'm not doing SWATs any more [23:42:22] (after I finish this one) [23:43:11] :( [23:44:24] (03PS3) 10MaxSem: Allow wmf-config/throttle.php to be lenient on ip/IP typo [mediawiki-config] - 10https://gerrit.wikimedia.org/r/280865 (https://phabricator.wikimedia.org/T131469) (owner: 10Dereckson) [23:44:51] (03CR) 10MaxSem: Allow wmf-config/throttle.php to be lenient on ip/IP typo [mediawiki-config] - 10https://gerrit.wikimedia.org/r/280865 (https://phabricator.wikimedia.org/T131469) (owner: 10Dereckson) [23:45:00] (03CR) 10MaxSem: [C: 032] Allow wmf-config/throttle.php to be lenient on ip/IP typo [mediawiki-config] - 10https://gerrit.wikimedia.org/r/280865 (https://phabricator.wikimedia.org/T131469) (owner: 10Dereckson) [23:45:13] oh sorry didn't see your 23:35:22 < MaxSem> Please rebase the change locally and upload again for review. [23:45:35] (03Merged) 10jenkins-bot: Allow wmf-config/throttle.php to be lenient on ip/IP typo [mediawiki-config] - 10https://gerrit.wikimedia.org/r/280865 (https://phabricator.wikimedia.org/T131469) (owner: 10Dereckson) [23:45:44] Generally, use Rebase button on Gerrit is enough [23:46:17] (03PS2) 10MaxSem: Clean expired throttling definitions [mediawiki-config] - 10https://gerrit.wikimedia.org/r/285989 (owner: 10Dereckson) [23:46:24] PROBLEM - HHVM rendering on mw1142 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [23:46:25] PROBLEM - Apache HTTP on mw1142 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [23:48:13] (03CR) 10MaxSem: [C: 032] Clean expired throttling definitions [mediawiki-config] - 10https://gerrit.wikimedia.org/r/285989 (owner: 10Dereckson) [23:48:34] PROBLEM - configured eth on mw1142 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [23:48:46] PROBLEM - dhclient process on mw1142 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [23:48:54] (03PS2) 10Dzahn: admin: remove access for tnegrin pt1 [puppet] - 10https://gerrit.wikimedia.org/r/285898 (https://phabricator.wikimedia.org/T90932) [23:48:54] PROBLEM - SSH on mw1142 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [23:48:55] PROBLEM - Check size of conntrack table on mw1142 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [23:49:15] PROBLEM - DPKG on mw1142 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [23:49:30] (03PS3) 10MaxSem: Revert "Increase abusefilter emergency disable threshold on MediaWiki.org" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/252627 (owner: 10Glaisher) [23:49:36] PROBLEM - puppet last run on mw1142 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [23:49:39] (03CR) 10MaxSem: [C: 032] Revert "Increase abusefilter emergency disable threshold on MediaWiki.org" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/252627 (owner: 10Glaisher) [23:49:55] PROBLEM - RAID on mw1142 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [23:50:04] 06Operations, 10Ops-Access-Requests, 13Patch-For-Review, 05Security: define in Puppet or remove user account - tnegrin - https://phabricator.wikimedia.org/T90932#2250300 (10Dzahn) @tnegrin Just to make sure before i merge this, did you mean you don't need any shell access on WMF servers anymore or just not... [23:50:12] (03Merged) 10jenkins-bot: Revert "Increase abusefilter emergency disable threshold on MediaWiki.org" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/252627 (owner: 10Glaisher) [23:50:15] PROBLEM - salt-minion processes on mw1142 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [23:51:01] (03PS2) 10MaxSem: GoogleNewsSitemap configuration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/285927 (https://phabricator.wikimedia.org/T39608) (owner: 10Dereckson) [23:51:10] (03CR) 10MaxSem: [C: 032] GoogleNewsSitemap configuration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/285927 (https://phabricator.wikimedia.org/T39608) (owner: 10Dereckson) [23:51:36] (03Merged) 10jenkins-bot: GoogleNewsSitemap configuration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/285927 (https://phabricator.wikimedia.org/T39608) (owner: 10Dereckson) [23:52:16] RECOVERY - configured eth on mw1142 is OK: OK - interfaces up [23:52:35] RECOVERY - dhclient process on mw1142 is OK: PROCS OK: 0 processes with command name dhclient [23:52:56] RECOVERY - DPKG on mw1142 is OK: All packages OK [23:53:08] who's deploying? [23:53:25] I was in the process of syncing [23:53:25] RECOVERY - puppet last run on mw1142 is OK: OK: Puppet is currently enabled, last run 37 minutes ago with 0 failures [23:53:33] but it's stuck, probably on mw1142 [23:53:45] RECOVERY - RAID on mw1142 is OK: OK: no RAID installed [23:53:50] MaxSem: sorry, all yours [23:53:55] RECOVERY - HHVM rendering on mw1142 is OK: HTTP OK: HTTP/1.1 200 OK - 67152 bytes in 0.407 second response time [23:53:56] RECOVERY - salt-minion processes on mw1142 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [23:54:05] RECOVERY - Apache HTTP on mw1142 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 626 bytes in 0.043 second response time [23:54:06] I don't mind others deploying in my window, but please ask firs :) [23:54:15] PROBLEM - HHVM rendering on mw1095 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [23:54:34] RECOVERY - SSH on mw1142 is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2.6 (protocol 2.0) [23:54:34] RECOVERY - Check size of conntrack table on mw1142 is OK: OK: nf_conntrack is 11 % full [23:54:42] MaxSem: yes, sorry. I assumed swat was over without checking. [23:55:04] PROBLEM - Apache HTTP on mw1095 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [23:55:55] RECOVERY - HHVM rendering on mw1095 is OK: HTTP OK: HTTP/1.1 200 OK - 67152 bytes in 0.181 second response time [23:56:43] !log maxsem@tin Synchronized wmf-config/InitialiseSettings.php: (no message) (duration: 02m 25s) [23:56:46] RECOVERY - Apache HTTP on mw1095 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 626 bytes in 0.043 second response time [23:56:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:57:10] PROBLEM - mysqld processes on holmium is CRITICAL: PROCS CRITICAL: 0 processes with command name mysqld [23:58:39] (03PS2) 10MaxSem: Enable lazy loaded references in beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/285553 (https://phabricator.wikimedia.org/T129693) (owner: 10Jdlrobson) [23:59:13] (03CR) 10MaxSem: [C: 032] Enable lazy loaded references in beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/285553 (https://phabricator.wikimedia.org/T129693) (owner: 10Jdlrobson) [23:59:15] PROBLEM - puppet last run on mw1142 is CRITICAL: CRITICAL: Puppet has 33 failures [23:59:16] !log maxsem@tin Synchronized wmf-config/throttle.php: (no message) (duration: 02m 24s) [23:59:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:59:49] (03Merged) 10jenkins-bot: Enable lazy loaded references in beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/285553 (https://phabricator.wikimedia.org/T129693) (owner: 10Jdlrobson) [23:59:58] MaxSem: (no message)?