[00:00:04] addshore, hashar, anomie, ostriches, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, and thcipriani: Dear anthropoid, the time has come. Please deploy Evening SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20161116T0000). [00:00:05] ebernhardson and MarcoAurelio: A patch you scheduled for Evening SWAT (Max 8 patches) is about to be deployed. Please be available during the process. [00:00:17] I'm here [00:01:00] I can SWAT. [00:02:32] \o [00:03:24] PROBLEM - puppet last run on nescio is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [00:05:42] (03CR) 10MZMcBride: Allow a wiki to use __NOINDEX__ and __INDEX__ in all namespaces (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/321712 (owner: 10Dereckson) [00:07:06] MarcoAurelio: you know how to use for? [00:07:15] (the loop PHP construct) [00:07:23] ? [00:07:41] Okay, we'll do that later, I'll explain you that. [00:08:01] (03CR) 10Dereckson: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/321797 (https://phabricator.wikimedia.org/T150807) (owner: 10MarcoAurelio) [00:08:12] I copied the above code which was used for abusefilter [00:08:14] basically it's a way not to copy/paste 3 or more times the same block, but define an array and ask to loop in this array [00:08:36] (03Merged) 10jenkins-bot: Allow 'interface-editor' & 'engineer' users to use OATHAuth [mediawiki-config] - 10https://gerrit.wikimedia.org/r/321797 (https://phabricator.wikimedia.org/T150807) (owner: 10MarcoAurelio) [00:08:55] Yes, your config is correct. [00:09:03] Still too much to learn [00:09:58] maybe someone could make that thing and we could add those fancy interface-editor-like groups from elsewhere too [00:10:54] PROBLEM - puppet last run on labstore1005 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [00:12:42] is it live on mw1099? [00:12:50] MarcoAurelio: Soon you will be a full on programmer :) [00:13:14] bawolff: I'll win the lottery first [00:14:03] MarcoAurelio: Well you know if statements, and you're about to learn loops. I think branches + jumps = turring complete [00:15:06] bawolff: well, not really. I just saw the 'if (blah)' and copied it. I understood what that thing was doing, but if you asked me I'd not be able to do that from the start [00:15:33] yet :P [00:15:53] MarcoAurelio: live on mw1099 [00:15:57] oh, ruwiki closers do have 'delete' [00:16:06] we need those added too... [00:16:10] * MarcoAurelio sighs [00:16:24] and ptwiki eliminators, are they added? [00:16:29] sigh^2 [00:16:35] checking [00:16:39] (03PS1) 10Mattflaschen: Add dewiktionary to RESTBase on Beta Cluster [puppet] - 10https://gerrit.wikimedia.org/r/321817 (https://phabricator.wikimedia.org/T150764) [00:16:43] (03CR) 10Dereckson: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/321724 (https://phabricator.wikimedia.org/T149740) (owner: 10EBernhardson) [00:17:35] Dereckson: checked ruwiki for engineer and they have oathauth-enable on mw1099, will check a random interface-editor wiki [00:17:37] MarcoAurelio, I don't think we were counting 'delete' as a right that needs to have 2FA protection? [00:17:57] idk, they deployed it to sysops [00:18:07] sysops have editinterface [00:18:14] far more dangerous [00:18:18] I care much more about editinterface than delete [00:18:25] (03PS2) 10Dereckson: Increase CirrusSearch interwiki load test to 25% [mediawiki-config] - 10https://gerrit.wikimedia.org/r/321724 (https://phabricator.wikimedia.org/T149740) (owner: 10EBernhardson) [00:18:37] (03CR) 10Dereckson: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/321724 (https://phabricator.wikimedia.org/T149740) (owner: 10EBernhardson) [00:18:40] Original rationale though was that admins have lots of privs, so likely to be targeted [00:18:57] well, the patch for editinterface is in the process of deployment :) [00:19:17] MarcoAurelio: Krenair: it's live on mw1099 by the way [00:19:21] but are also a smallish group because 2FA hasn't been super well tested yet, and procedures for people losing phones is still a bit up in the air [00:19:23] (03Merged) 10jenkins-bot: Increase CirrusSearch interwiki load test to 25% [mediawiki-config] - 10https://gerrit.wikimedia.org/r/321724 (https://phabricator.wikimedia.org/T149740) (owner: 10EBernhardson) [00:19:30] ebernhardson: live on mw1099 too [00:19:38] checking [00:20:21] Dereckson: looks great [00:20:58] Dereckson: checked elwiktionary for interface-editor as well, and oathauth-enable is there on mw1099, I think we can roll this everywhere [00:22:27] * MarcoAurelio renames elwiktionary editinterface to apergo 's group :P [00:23:21] :-P [00:23:26] I can't believe I'm still in here [00:23:31] * apergos goes for real [00:25:05] !log dereckson@tin Synchronized wmf-config/CommonSettings.php: Allow interface-editor & engineer users to use OATHAuth (T150807) (duration: 00m 59s) [00:25:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:25:28] T150807: Add 'interface-editors' to the list of users who can enable OATHAuth on WMF wikis - https://phabricator.wikimedia.org/T150807 [00:26:08] !log dereckson@tin Synchronized wmf-config/CirrusSearch-production.php: Increase CirrusSearch interwiki load test to 25% (T149740) (duration: 00m 58s) [00:26:09] yet <-- heh, maybe; I'm always open to learn new things [00:26:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:26:27] T149740: Run load tests of cross-project searching to verify its stability - https://phabricator.wikimedia.org/T149740 [00:31:24] RECOVERY - puppet last run on nescio is OK: OK: Puppet is currently enabled, last run 3 seconds ago with 0 failures [00:32:00] Dereckson: unless I'm required for further testing, I'm off to bed [00:39:54] RECOVERY - puppet last run on labstore1005 is OK: OK: Puppet is currently enabled, last run 42 seconds ago with 0 failures [00:42:02] (03CR) 10Filippo Giunchedi: [C: 04-1] Enable multiple config files in phabricator (037 comments) [puppet] - 10https://gerrit.wikimedia.org/r/321654 (https://phabricator.wikimedia.org/T146055) (owner: 1020after4) [01:07:37] (03PS1) 10Yuvipanda: Revert "Route all logs to /dev/null" [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/321828 [01:19:42] (03CR) 10Yuvipanda: [C: 032 V: 032] Revert "Route all logs to /dev/null" [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/321828 (owner: 10Yuvipanda) [01:20:43] 06Operations, 06Commons, 10MediaWiki-File-management, 06Multimedia, 07Upstream: Issues with displaying thumbnails for CMYK JPG images due to buggy version of ImageMagick (black horizontal stripes, black color missing) - https://phabricator.wikimedia.org/T141739#2797820 (10Jdforrester-WMF) OK, upstream ha... [01:21:52] 06Operations, 06Commons, 06Multimedia: Deploy some fixed version of ImageMagick from apt.wikimedia.org - https://phabricator.wikimedia.org/T150432#2797821 (10Dereckson) >>! In T141739#2797820, @Jdforrester-WMF wrote: > OK, upstream have released ImageMagick 6.9.6-5 with a fix for the issue. https://github.co... [01:34:54] PROBLEM - mobileapps endpoints health on scb1003 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [01:35:54] RECOVERY - mobileapps endpoints health on scb1003 is OK: All endpoints are healthy [01:37:47] !log restbase201[0-2] - signing puppet certs, salt-key, initial run [01:38:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:53:48] 06Operations, 10ops-codfw, 13Patch-For-Review: rack/setup restbase201[0-2] - https://phabricator.wikimedia.org/T150680#2797867 (10Papaul) [01:54:33] 06Operations, 10ops-codfw, 13Patch-For-Review: rack/setup restbase201[0-2] - https://phabricator.wikimedia.org/T150680#2793238 (10Papaul) a:05Papaul>03fgiunchedi @fgiunchedi you can take over. Thanks [02:04:48] PROBLEM - puppet last run on db1038 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [02:09:08] PROBLEM - puppet last run on pollux is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [02:20:10] !log l10nupdate@tin scap sync-l10n completed (1.29.0-wmf.2) (duration: 06m 08s) [02:20:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:22:07] (03PS1) 10Dzahn: remove wmf3762.mgmt.frack.codfw.wmnet [dns] - 10https://gerrit.wikimedia.org/r/321832 (https://phabricator.wikimedia.org/T149875) [02:23:11] (03PS2) 10Dzahn: remove wmf3762.mgmt.frack.codfw.wmnet [dns] - 10https://gerrit.wikimedia.org/r/321832 (https://phabricator.wikimedia.org/T149875) [02:23:32] (03CR) 10Dzahn: [C: 032] remove wmf3762.mgmt.frack.codfw.wmnet [dns] - 10https://gerrit.wikimedia.org/r/321832 (https://phabricator.wikimedia.org/T149875) (owner: 10Dzahn) [02:25:28] 06Operations, 10DNS, 10Traffic, 13Patch-For-Review: mgmt hosts that exist but don't resolve to an IP - https://phabricator.wikimedia.org/T149875#2797909 (10Dzahn) [02:28:09] 06Operations, 10DNS, 10Traffic, 13Patch-For-Review: mgmt hosts that exist but don't resolve to an IP - https://phabricator.wikimedia.org/T149875#2797911 (10Dzahn) 05Open>03Resolved all done. ran getmgmtips rejects.txt stays empty. [02:30:33] (03CR) 10Dzahn: "what the .. why did this add it on kubernetes-worker instead of krypton" [puppet] - 10https://gerrit.wikimedia.org/r/316041 (owner: 10Dzahn) [02:32:32] (03CR) 10Dzahn: "wow, compare PS 1, 2 and 3, the first 2 do this on node "krypton" as intended, and then on PS3 it changes to "kubernetes-worker" and all t" [puppet] - 10https://gerrit.wikimedia.org/r/316041 (owner: 10Dzahn) [02:33:48] RECOVERY - puppet last run on db1038 is OK: OK: Puppet is currently enabled, last run 37 seconds ago with 0 failures [02:37:13] RECOVERY - puppet last run on pollux is OK: OK: Puppet is currently enabled, last run 37 seconds ago with 0 failures [02:38:14] (03PS2) 10Jforrester: Make notification logos high-density [mediawiki-config] - 10https://gerrit.wikimedia.org/r/319968 (https://phabricator.wikimedia.org/T147219) (owner: 10Catrope) [02:38:17] (03PS2) 10Jforrester: Fix notification icon path for foundationwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/319967 (owner: 10Catrope) [02:38:20] (03PS6) 10Jforrester: Update instances of Wikimedia Foundation logo #1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/307475 (https://phabricator.wikimedia.org/T144254) (owner: 10Urbanecm) [02:39:22] (03CR) 10Jforrester: "PS2: Rebase." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/319967 (owner: 10Catrope) [02:39:32] (03CR) 10Jforrester: "PS2: Rebase." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/319968 (https://phabricator.wikimedia.org/T147219) (owner: 10Catrope) [02:41:12] (03CR) 10Jforrester: [C: 04-2] "PS6: Rebased; re-did the logo exports from scratch, made changes to the notification icon (in both sizes), removed changes to loginwiki." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/307475 (https://phabricator.wikimedia.org/T144254) (owner: 10Urbanecm) [02:43:12] (03PS1) 10Dzahn: Revert "add mapped IPv6 address for krypton" [puppet] - 10https://gerrit.wikimedia.org/r/321833 [02:44:58] PROBLEM - mobileapps endpoints health on scb1003 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:45:58] RECOVERY - mobileapps endpoints health on scb1003 is OK: All endpoints are healthy [02:46:40] (03PS2) 10Dzahn: Revert "add mapped IPv6 address for krypton" [puppet] - 10https://gerrit.wikimedia.org/r/321833 [02:46:48] !log l10nupdate@tin scap sync-l10n completed (1.29.0-wmf.3) (duration: 10m 41s) [02:47:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:47:39] (03CR) 10Dzahn: [C: 032] Revert "add mapped IPv6 address for krypton" [puppet] - 10https://gerrit.wikimedia.org/r/321833 (owner: 10Dzahn) [02:52:16] !log l10nupdate@tin ResourceLoader cache refresh completed at Wed Nov 16 02:52:15 UTC 2016 (duration 5m 27s) [02:52:29] omg, some changed the content model of en wikipedia's main page [02:52:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:54:17] 06Operations, 10ops-eqiad, 10Continuous-Integration-Infrastructure (phase-out-gallium): decom gallium (data center) - https://phabricator.wikimedia.org/T150316#2797918 (10Dzahn) a:05Dzahn>03None [02:55:01] 06Operations, 10ops-eqiad, 10Continuous-Integration-Infrastructure (phase-out-gallium): decom gallium (data center) - https://phabricator.wikimedia.org/T150316#2781777 (10Dzahn) [02:55:19] 06Operations, 10ops-eqiad, 10Continuous-Integration-Infrastructure (phase-out-gallium): decom gallium (data center) - https://phabricator.wikimedia.org/T150316#2781777 (10Dzahn) p:05High>03Normal [02:56:18] (03PS7) 1020after4: Enable multiple config files in phabricator [puppet] - 10https://gerrit.wikimedia.org/r/321654 (https://phabricator.wikimedia.org/T146055) [02:59:21] (03CR) 1020after4: "Addressed filippo's feedback" (035 comments) [puppet] - 10https://gerrit.wikimedia.org/r/321654 (https://phabricator.wikimedia.org/T146055) (owner: 1020after4) [03:05:21] (03PS8) 1020after4: Enable multiple config files in phabricator [puppet] - 10https://gerrit.wikimedia.org/r/321654 (https://phabricator.wikimedia.org/T146055) [03:06:44] (03CR) 10jenkins-bot: [V: 04-1] Enable multiple config files in phabricator [puppet] - 10https://gerrit.wikimedia.org/r/321654 (https://phabricator.wikimedia.org/T146055) (owner: 1020after4) [03:08:46] (03PS9) 1020after4: Enable multiple config files in phabricator [puppet] - 10https://gerrit.wikimedia.org/r/321654 (https://phabricator.wikimedia.org/T146055) [03:23:08] PROBLEM - MariaDB Slave Lag: s1 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 605.68 seconds [03:25:08] PROBLEM - MariaDB Slave Lag: s1 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 640.81 seconds [03:37:08] RECOVERY - MariaDB Slave Lag: s1 on dbstore1002 is OK: OK slave_sql_lag Replication lag: 273.62 seconds [04:34:58] (03PS10) 1020after4: Enable multiple config files in phabricator [puppet] - 10https://gerrit.wikimedia.org/r/321654 (https://phabricator.wikimedia.org/T146055) [05:41:58] PROBLEM - puppet last run on elastic1036 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [06:02:48] PROBLEM - puppet last run on puppetmaster2001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [06:06:48] PROBLEM - IPv4 ping to eqiad on ripe-atlas-eqiad is CRITICAL: CRITICAL - failed 45 probes of 409 (alerts on 19) - https://atlas.ripe.net/measurements/1790945/#!map [06:09:58] RECOVERY - puppet last run on elastic1036 is OK: OK: Puppet is currently enabled, last run 53 seconds ago with 0 failures [06:11:48] RECOVERY - IPv4 ping to eqiad on ripe-atlas-eqiad is OK: OK - failed 2 probes of 409 (alerts on 19) - https://atlas.ripe.net/measurements/1790945/#!map [06:30:48] RECOVERY - puppet last run on puppetmaster2001 is OK: OK: Puppet is currently enabled, last run 7 seconds ago with 0 failures [06:49:15] 06Operations, 10Mobile-Content-Service, 06Wikipedia-Android-App-Backlog, 07Mobile: Some users can't login or edit without proxy in Iran - https://phabricator.wikimedia.org/T142309#2798094 (10Niedzielski) @Darafsh, are you on Wi-Fi, mobile cellular network, or mobile network with Wikipedia Zero (there shoul... [06:54:25] 06Operations, 10Mobile-Content-Service, 06Wikipedia-Android-App-Backlog, 07Mobile: Some users can't login or edit without proxy in Iran - https://phabricator.wikimedia.org/T142309#2798096 (10Ladsgroup) Iran is not participating in Wikipedia Zero. One thing is, It seems the issue got resolved in the past co... [07:09:44] 06Operations, 10MediaWiki-General-or-Unknown, 10Traffic: Failure to save recent changes - https://phabricator.wikimedia.org/T150503#2798106 (10Marostegui) 05Open>03stalled p:05Unbreak!>03High [07:12:18] (03CR) 1020after4: [C: 04-1] "hold off on this until the related changes are deployed." (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/321654 (https://phabricator.wikimedia.org/T146055) (owner: 1020after4) [07:17:25] (03CR) 10Marostegui: "No problem :)" [puppet] - 10https://gerrit.wikimedia.org/r/320752 (https://phabricator.wikimedia.org/T149829) (owner: 10Jcrespo) [07:18:55] (03CR) 10Marostegui: "You mean a directory that would be like: mariadb/backups/ and cointain backup.pp, otrsbackups.pp etc?" [puppet] - 10https://gerrit.wikimedia.org/r/320989 (owner: 10Marostegui) [07:30:37] !log Stopping replication in db2066 for maintenance - T150518 [07:30:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:30:59] T150518: Import S5 to dbstore2001 and dbstore2002 + compression - https://phabricator.wikimedia.org/T150518 [07:31:42] 06Operations, 10OCG-General, 06Wiktionary, 13Patch-For-Review: Download as PDF does not work in English Wiktionary: "There was an error while attempting to render your book." - https://phabricator.wikimedia.org/T150604#2798126 (10Marostegui) I see the above change was merged, is it going to be deployed today? [07:50:40] marostegui: Do you know anything about how to get tables replicated to Labs? [07:50:48] from production [07:51:14] kaldari: what do you mean or need? [07:52:16] I need to get the page_assessments and page_assessments_projects tables on enwiki replicated to Labs. They don't contain any private info. [07:53:04] kaldari: Do you mind creating a ticket for that so we can track it? I will ping jynus about it, so he can educate me on how to handle that [07:53:30] Sure [07:53:33] chances are it's already being replicated and you just need to stick it in the maintain-views config, then run maintain-views for each database the table exists in [07:54:25] Krenair: You are right, I can see the tables in labsdb1003 for instance [07:54:28] Krenair: Where is the maintain-views config? Doesn't seem to be in wmf-config [07:54:43] that stuff is in puppet rather than the normal MW config [07:55:10] this is mysql infrastructure stuff rather than something MW controls [07:55:14] Krenair: Is that something I can do myself or better to have a DBA do it? [07:55:23] you can upload the patch to the puppet repo [07:55:33] correction, this is labs infrastructure [07:55:42] that's not a correction [08:01:41] Krenair: I don't see anything in the operations/puppet repo called maintain-views. Any pointers? [08:02:38] PROBLEM - puppet last run on cp1060 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [08:02:39] (03CR) 10Jcrespo: "I was thinking more of /backups/init.pp and /backups/otrs.pp, but I have not checked the original identifiers." [puppet] - 10https://gerrit.wikimedia.org/r/320989 (owner: 10Marostegui) [08:02:52] kaldari, modules/role/templates/labsdb/maintain-views.yaml [08:03:00] script is at modules/role/files/labsdb/maintain-views.py [08:03:07] 06Operations, 10Mobile-Content-Service, 06Wikipedia-Android-App-Backlog, 07Mobile: Some users can't login or edit without proxy in Iran - https://phabricator.wikimedia.org/T142309#2798137 (10Niedzielski) 05Open>03Resolved a:03Niedzielski @Ladsgroup, thanks for your diligence. I'm sorry it took so lon... [08:03:26] it got renamed from 'maintain-replicas' at some stage so you may find some old references to that [08:04:06] Krenair: Ah, looks like my puppet repo is way out of date (which is why I didn't find it) [08:04:23] yeah it was imported from a different repo relatively recently [08:04:33] I'm going AFK for a while [08:04:41] Thanks! [08:04:46] kaldari, there is a tracking ticket for those tasks https://phabricator.wikimedia.org/T150767 [08:05:03] if you put a ticket in, ... [08:05:15] yeah that [08:23:41] (03PS1) 10Kaldari: Adding views for two PageAssessments tables for Labs [puppet] - 10https://gerrit.wikimedia.org/r/321845 [08:28:00] (03PS1) 10Muehlenhoff: Add debdeploy salt grain for labs::db::proxy [puppet] - 10https://gerrit.wikimedia.org/r/321846 [08:30:45] (03CR) 10Marostegui: "I have added Jaime as a reviewer - so I can also learn the process here." [puppet] - 10https://gerrit.wikimedia.org/r/321845 (owner: 10Kaldari) [08:31:38] RECOVERY - puppet last run on cp1060 is OK: OK: Puppet is currently enabled, last run 56 seconds ago with 0 failures [08:39:35] 06Operations, 06Commons, 10MediaWiki-File-management, 06Multimedia, 07Upstream: Issues with displaying thumbnails for CMYK JPG images due to buggy version of ImageMagick (black horizontal stripes, black color missing) - https://phabricator.wikimedia.org/T141739#2798176 (10MoritzMuehlenhoff) I'll look int... [08:48:51] !log installing libgd security updates on remaining app servers [08:49:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:53:28] PROBLEM - puppet last run on mw1178 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [08:57:27] (03PS2) 10Muehlenhoff: Provide a systemd override unit for memcached [puppet] - 10https://gerrit.wikimedia.org/r/319820 [08:58:36] (03CR) 10Giuseppe Lavagetto: [C: 031] "I mainly verified that the HHVM collection is correct, and it is." [software/hhvm_exporter] - 10https://gerrit.wikimedia.org/r/319477 (https://phabricator.wikimedia.org/T147423) (owner: 10Filippo Giunchedi) [09:05:29] 06Operations, 10Beta-Cluster-Infrastructure, 10Thumbor: Thumbor keeps losing Swift auth on beta - https://phabricator.wikimedia.org/T150649#2792043 (10hashar) p:05Triage>03Normal [09:18:45] 06Operations: Monitor failing ferm restarts / availability of ferm service - https://phabricator.wikimedia.org/T108303#2798218 (10MoritzMuehlenhoff) 05Open>03Resolved We now have an Icinga check for ferm. [09:22:28] RECOVERY - puppet last run on mw1178 is OK: OK: Puppet is currently enabled, last run 53 seconds ago with 0 failures [09:24:57] (03PS2) 10Giuseppe Lavagetto: Parsoid: Use Scap3 for config-file deploys [puppet] - 10https://gerrit.wikimedia.org/r/315069 (https://phabricator.wikimedia.org/T144596) (owner: 10Mobrovac) [09:26:57] (03CR) 10Giuseppe Lavagetto: [C: 032] Parsoid: Use Scap3 for config-file deploys [puppet] - 10https://gerrit.wikimedia.org/r/315069 (https://phabricator.wikimedia.org/T144596) (owner: 10Mobrovac) [09:35:28] PROBLEM - puppet last run on maps1003 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [09:44:11] !log parsoid deployed e41b235 [09:44:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:53:41] 06Operations, 10Parsoid, 06Services (done), 15User-mobrovac: Deploy failed on wtp2017.codfw.wmnet - https://phabricator.wikimedia.org/T149115#2798292 (10mobrovac) 05Open>03Resolved The new scripts will now officially be used, resolving. [09:57:24] (03PS2) 10Jcrespo: Repool db2042 after maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/321476 (https://phabricator.wikimedia.org/T150334) [10:03:04] (03CR) 10Jcrespo: [C: 032] Repool db2042 after maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/321476 (https://phabricator.wikimedia.org/T150334) (owner: 10Jcrespo) [10:04:12] (03Merged) 10jenkins-bot: Repool db2042 after maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/321476 (https://phabricator.wikimedia.org/T150334) (owner: 10Jcrespo) [10:04:28] RECOVERY - puppet last run on maps1003 is OK: OK: Puppet is currently enabled, last run 57 seconds ago with 0 failures [10:07:02] !log jynus@tin Synchronized wmf-config/db-codfw.php: Repool db2042 (duration: 00m 49s) [10:07:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:12:37] !log tstarling@tin Started scap: (no message) [10:12:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:13:02] 06Operations, 07LDAP, 13Patch-For-Review: Enhance group membership visibility using the memberof LDAP overlay - https://phabricator.wikimedia.org/T142817#2798376 (10MoritzMuehlenhoff) I checked how changes to the nda group have trickled in and for more than half of the members the change only affected the sl... [10:16:34] !log applying schema change on s2 (page) T69223 [10:16:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:16:55] T69223: Schema change for page content language - https://phabricator.wikimedia.org/T69223 [10:22:17] 06Operations, 10ops-codfw, 10DBA: db2034 crashes meta ticket - https://phabricator.wikimedia.org/T150233#2798398 (10Marostegui) p:05Triage>03High [10:35:12] !log tstarling@tin Finished scap: (no message) (duration: 22m 34s) [10:35:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:40:26] (03CR) 10Alexandros Kosiaris: "OK, these way more justified technical reasons." [puppet] - 10https://gerrit.wikimedia.org/r/320690 (https://phabricator.wikimedia.org/T143925) (owner: 10Ottomata) [10:40:56] !log tstarling@tin Synchronized wmf-config/llama.php: (no message) (duration: 00m 48s) [10:41:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:42:48] (03CR) 10Marostegui: "Do you mean creating the init.pp inside of: mariadb/backups (which would be a new directory)/?" [puppet] - 10https://gerrit.wikimedia.org/r/320989 (owner: 10Marostegui) [10:49:59] !log tstarling@tin Synchronized wmf-config/llama.php: (no message) (duration: 00m 48s) [10:50:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:52:57] !log rebooting labsdb1007 (OSM slave) for kernel update [10:53:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:01:21] (03CR) 10Jcrespo: "I do not really care, whatever works and you can decide what looks better for you." [puppet] - 10https://gerrit.wikimedia.org/r/320989 (owner: 10Marostegui) [11:01:31] !log rolling cache_text upgrade to varnish 4.1.3-1wm4 and reboot with linux 4.4.2-3+wmf7 [11:01:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:03:37] (03CR) 10Jcrespo: "https://wikitech.wikimedia.org/wiki/MariaDB#Account_handling" [puppet] - 10https://gerrit.wikimedia.org/r/320822 (https://phabricator.wikimedia.org/T150446) (owner: 10Jcrespo) [11:03:49] !log rebooting labsdb1006 (OSM master) for kernel update [11:04:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:10:58] PROBLEM - Varnish HTTP text-frontend - port 3122 on cp1065 is CRITICAL: connect to address 10.64.0.102 and port 3122: Connection refused [11:10:58] PROBLEM - Varnish HTTP text-frontend - port 3125 on cp1065 is CRITICAL: connect to address 10.64.0.102 and port 3125: Connection refused [11:11:04] thats' [11:11:08] PROBLEM - Varnish HTTP text-frontend - port 3124 on cp1065 is CRITICAL: connect to address 10.64.0.102 and port 3124: Connection refused [11:11:12] me :) ^ [11:11:18] PROBLEM - Varnish HTTP text-frontend - port 3120 on cp1065 is CRITICAL: connect to address 10.64.0.102 and port 3120: Connection refused [11:11:38] PROBLEM - puppet last run on cp1065 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 10 seconds ago with 1 failures. Failed resources (up to 3 shown): Exec[retry-load-new-vcl-file] [11:11:58] RECOVERY - Varnish HTTP text-frontend - port 3122 on cp1065 is OK: HTTP OK: HTTP/1.1 200 OK - 325 bytes in 0.000 second response time [11:11:58] RECOVERY - Varnish HTTP text-frontend - port 3125 on cp1065 is OK: HTTP OK: HTTP/1.1 200 OK - 325 bytes in 0.001 second response time [11:12:08] RECOVERY - Varnish HTTP text-frontend - port 3124 on cp1065 is OK: HTTP OK: HTTP/1.1 200 OK - 325 bytes in 0.001 second response time [11:12:18] RECOVERY - Varnish HTTP text-frontend - port 3120 on cp1065 is OK: HTTP OK: HTTP/1.1 200 OK - 325 bytes in 0.000 second response time [11:12:38] RECOVERY - puppet last run on cp1065 is OK: OK: Puppet is currently enabled, last run 33 seconds ago with 0 failures [11:17:18] PROBLEM - NTP on cp1065 is CRITICAL: NTP CRITICAL: Offset unknown [11:17:38] PROBLEM - puppet last run on labcontrol1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [11:26:07] (03PS3) 10Jcrespo: mariadb: Enable unix socket authentication everywhere [puppet] - 10https://gerrit.wikimedia.org/r/320822 (https://phabricator.wikimedia.org/T150446) [11:26:23] (03CR) 10jenkins-bot: [V: 04-1] mariadb: Enable unix socket authentication everywhere [puppet] - 10https://gerrit.wikimedia.org/r/320822 (https://phabricator.wikimedia.org/T150446) (owner: 10Jcrespo) [11:26:59] (03PS4) 10Jcrespo: mariadb: Enable unix socket authentication everywhere [puppet] - 10https://gerrit.wikimedia.org/r/320822 (https://phabricator.wikimedia.org/T150446) [11:27:18] RECOVERY - NTP on cp1065 is OK: NTP OK: Offset 0.0005805492401 secs [11:27:49] (03CR) 10Marostegui: "Thanks for the documentation!" [puppet] - 10https://gerrit.wikimedia.org/r/320822 (https://phabricator.wikimedia.org/T150446) (owner: 10Jcrespo) [11:28:38] PROBLEM - DPKG on heze is CRITICAL: DPKG CRITICAL dpkg reports broken packages [11:29:17] (03PS1) 10Alexandros Kosiaris: heze: reinstall as jessie [puppet] - 10https://gerrit.wikimedia.org/r/321859 [11:30:17] (03CR) 10Jcrespo: [C: 032] mariadb: Enable unix socket authentication everywhere [puppet] - 10https://gerrit.wikimedia.org/r/320822 (https://phabricator.wikimedia.org/T150446) (owner: 10Jcrespo) [11:31:58] PROBLEM - NTP on cp4009 is CRITICAL: NTP CRITICAL: Offset unknown [11:32:28] PROBLEM - NTP on cp1054 is CRITICAL: NTP CRITICAL: Offset unknown [11:32:59] PROBLEM - DPKG on oresrdb1002 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [11:35:50] (03CR) 10Alexandros Kosiaris: [C: 032] heze: reinstall as jessie [puppet] - 10https://gerrit.wikimedia.org/r/321859 (owner: 10Alexandros Kosiaris) [11:35:56] (03PS2) 10Alexandros Kosiaris: heze: reinstall as jessie [puppet] - 10https://gerrit.wikimedia.org/r/321859 [11:35:58] RECOVERY - DPKG on oresrdb1002 is OK: All packages OK [11:36:00] (03CR) 10Alexandros Kosiaris: [V: 032] heze: reinstall as jessie [puppet] - 10https://gerrit.wikimedia.org/r/321859 (owner: 10Alexandros Kosiaris) [11:41:58] RECOVERY - NTP on cp4009 is OK: NTP OK: Offset -0.0001101493835 secs [11:42:14] (03PS1) 10Mobrovac: CXServer: Use Scap3 to deploy the config [puppet] - 10https://gerrit.wikimedia.org/r/321860 (https://phabricator.wikimedia.org/T147634) [11:42:28] RECOVERY - NTP on cp1054 is OK: NTP OK: Offset 0.0002536475658 secs [11:45:38] RECOVERY - puppet last run on labcontrol1001 is OK: OK: Puppet is currently enabled, last run 54 seconds ago with 0 failures [11:45:42] (03PS2) 10Mobrovac: CXServer: Use Scap3 to deploy the config [puppet] - 10https://gerrit.wikimedia.org/r/321860 (https://phabricator.wikimedia.org/T147634) [11:47:48] PROBLEM - NTP on oresrdb1002 is CRITICAL: NTP CRITICAL: Offset unknown [11:48:23] (03CR) 10Mobrovac: "PCC looking good - https://puppet-compiler.wmflabs.org/4594/" [puppet] - 10https://gerrit.wikimedia.org/r/321860 (https://phabricator.wikimedia.org/T147634) (owner: 10Mobrovac) [11:57:48] RECOVERY - NTP on oresrdb1002 is OK: NTP OK: Offset 0.0002198219299 secs [12:00:22] Hey Ops, I need to remove one of my contacts from the private repo [12:00:28] anyone around to do it? [12:02:24] Amir1: I can try to help and if not escalate to someone that knows :) [12:03:39] marostegui: thanks, you need to login to puppetmaster1001 and make a patch there [12:03:47] yeah, I am already there [12:03:54] What do you need then? [12:03:59] https://wikitech.wikimedia.org/wiki/Puppet#Private_puppet [12:04:35] (03CR) 10Muehlenhoff: [C: 032] Add debdeploy salt grain for labs::db::proxy [puppet] - 10https://gerrit.wikimedia.org/r/321846 (owner: 10Muehlenhoff) [12:04:40] (03PS2) 10Muehlenhoff: Add debdeploy salt grain for labs::db::proxy [puppet] - 10https://gerrit.wikimedia.org/r/321846 [12:05:58] !log installing trusty kernel updates [12:06:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:10:54] what's the current status of superprotect ? [12:14:20] it's gone? [12:15:07] could really do with someone superprotecting the en.wp Main Page right about now. [12:15:18] PROBLEM - puppet last run on db1066 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [12:15:42] another compromised admin account and the Main Page messed about with. [12:16:24] apergos: ^^^ [12:16:35] yeah aware [12:22:45] Maybe a notice should go on wikipedia for all admins to change there passwords? [12:22:58] PROBLEM - NTP on cp2001 is CRITICAL: NTP CRITICAL: Offset unknown [12:23:05] Or make it manditory to change your passwords for wikipedia? [12:23:38] Otherwise they will just keep trying to hack into users accounts making the user look bad when they did not do it, it was the hackers. [12:24:29] !log deploying unix_socket authentication to all core databases T150446 [12:24:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:32:57] RECOVERY - NTP on cp2001 is OK: NTP OK: Offset 6.479024887e-05 secs [12:34:00] (03PS1) 10Hashar: Revert "Add German Wiktionary in beta" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/321866 (https://phabricator.wikimedia.org/T150764) [12:34:37] RECOVERY - DPKG on heze is OK: All packages OK [12:34:56] (03CR) 10Hashar: [C: 032] Revert "Add German Wiktionary in beta" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/321866 (https://phabricator.wikimedia.org/T150764) (owner: 10Hashar) [12:35:33] (03Merged) 10jenkins-bot: Revert "Add German Wiktionary in beta" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/321866 (https://phabricator.wikimedia.org/T150764) (owner: 10Hashar) [12:37:09] 06Operations, 10Wikimedia-Apache-configuration: catch-all apache vhost on the cluster should return 404 for non-existing sites - https://phabricator.wikimedia.org/T137176#2798770 (10elukey) ``` elukey@mw1099:~$ sudo apachectl -S VirtualHost configuration: 127.0.0.1:80 localhost (/etc/apache2/conf-ena... [12:44:17] RECOVERY - puppet last run on db1066 is OK: OK: Puppet is currently enabled, last run 38 seconds ago with 0 failures [12:56:17] PROBLEM - DPKG on analytics1029 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [12:57:18] RECOVERY - DPKG on analytics1029 is OK: All packages OK [13:27:21] 06Operations, 06Labs, 10MediaWiki-extensions-TwoFactorAuthentication, 10wikitech.wikimedia.org: Can't login wikitech - https://phabricator.wikimedia.org/T144805#2798872 (10Shizhao) [13:35:18] PROBLEM - salt-minion processes on heze is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/salt-minion [13:39:08] jouncebot: next [13:39:09] In 0 hour(s) and 20 minute(s): European Mid-day SWAT(Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20161116T1400) [13:39:18] RECOVERY - salt-minion processes on heze is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [13:41:53] (03PS5) 10Hashar: Remove patrol from autoconfirmed and reviewer for enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/318515 (https://phabricator.wikimedia.org/T149019) (owner: 10Cenarium) [13:42:46] (03CR) 10Hashar: [C: 031] "All good. Match T149019 definition, eg having "patrol" right solely for the "patroller" group." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/318515 (https://phabricator.wikimedia.org/T149019) (owner: 10Cenarium) [13:48:08] PROBLEM - Start a job and verify on Precise on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 SERVICE UNAVAILABLE - string OK not found on http://checker.tools.wmflabs.org:80/grid/start/precise - 185 bytes in 0.424 second response time [13:48:08] PROBLEM - Start a job and verify on Trusty on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 SERVICE UNAVAILABLE - string OK not found on http://checker.tools.wmflabs.org:80/grid/start/trusty - 185 bytes in 0.372 second response time [13:48:28] PROBLEM - tools homepage -admin tool- on tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 531 bytes in 0.005 second response time [13:49:08] RECOVERY - Start a job and verify on Precise on checker.tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 166 bytes in 0.365 second response time [13:49:08] RECOVERY - Start a job and verify on Trusty on checker.tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 166 bytes in 0.303 second response time [13:49:28] RECOVERY - tools homepage -admin tool- on tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 3670 bytes in 0.052 second response time [13:54:07] ACKNOWLEDGEMENT - IPsec on cp1045 is CRITICAL: Strongswan CRITICAL - ok: 22 not-conn: cp3009_v4, cp3009_v6 Ema cp3009 is down: T148422 [13:54:07] ACKNOWLEDGEMENT - IPsec on cp1051 is CRITICAL: Strongswan CRITICAL - ok: 22 connecting: cp3009_v4, cp3009_v6 Ema cp3009 is down: T148422 [13:54:07] ACKNOWLEDGEMENT - IPsec on cp1058 is CRITICAL: Strongswan CRITICAL - ok: 22 not-conn: cp3009_v4, cp3009_v6 Ema cp3009 is down: T148422 [13:54:07] ACKNOWLEDGEMENT - IPsec on cp1061 is CRITICAL: Strongswan CRITICAL - ok: 22 connecting: cp3009_v4, cp3009_v6 Ema cp3009 is down: T148422 [13:54:07] ACKNOWLEDGEMENT - IPsec on cp2006 is CRITICAL: Strongswan CRITICAL - ok: 34 not-conn: cp3009_v4, cp3009_v6 Ema cp3009 is down: T148422 [13:54:07] ACKNOWLEDGEMENT - IPsec on cp2012 is CRITICAL: Strongswan CRITICAL - ok: 34 connecting: cp3009_v4, cp3009_v6 Ema cp3009 is down: T148422 [13:54:08] ACKNOWLEDGEMENT - IPsec on cp2018 is CRITICAL: Strongswan CRITICAL - ok: 34 not-conn: cp3009_v4, cp3009_v6 Ema cp3009 is down: T148422 [13:54:08] ACKNOWLEDGEMENT - IPsec on cp2025 is CRITICAL: Strongswan CRITICAL - ok: 34 connecting: cp3009_v4, cp3009_v6 Ema cp3009 is down: T148422 [13:54:09] ACKNOWLEDGEMENT - IPsec on kafka1012 is CRITICAL: Strongswan CRITICAL - ok: 146 not-conn: cp3009_v4, cp3009_v6 Ema cp3009 is down: T148422 [13:54:09] ACKNOWLEDGEMENT - IPsec on kafka1013 is CRITICAL: Strongswan CRITICAL - ok: 146 not-conn: cp3009_v4, cp3009_v6 Ema cp3009 is down: T148422 [13:54:10] ACKNOWLEDGEMENT - IPsec on kafka1014 is CRITICAL: Strongswan CRITICAL - ok: 146 not-conn: cp3009_v4, cp3009_v6 Ema cp3009 is down: T148422 [13:54:11] ACKNOWLEDGEMENT - IPsec on kafka1018 is CRITICAL: Strongswan CRITICAL - ok: 146 not-conn: cp3009_v4, cp3009_v6 Ema cp3009 is down: T148422 [13:54:11] ACKNOWLEDGEMENT - IPsec on kafka1020 is CRITICAL: Strongswan CRITICAL - ok: 146 not-conn: cp3009_v4, cp3009_v6 Ema cp3009 is down: T148422 [13:54:11] ACKNOWLEDGEMENT - IPsec on kafka1022 is CRITICAL: Strongswan CRITICAL - ok: 146 not-conn: cp3009_v4, cp3009_v6 Ema cp3009 is down: T148422 [13:54:28] PROBLEM - tools homepage -admin tool- on tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 531 bytes in 0.006 second response time [13:54:57] (03PS1) 10Rush: labstore: add tools-home and tools-project to nfs-exportd [puppet] - 10https://gerrit.wikimedia.org/r/321875 [13:55:08] PROBLEM - Start a job and verify on Precise on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 SERVICE UNAVAILABLE - string OK not found on http://checker.tools.wmflabs.org:80/grid/start/precise - 185 bytes in 0.420 second response time [13:55:08] PROBLEM - Start a job and verify on Trusty on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 SERVICE UNAVAILABLE - string OK not found on http://checker.tools.wmflabs.org:80/grid/start/trusty - 185 bytes in 0.451 second response time [13:55:58] PROBLEM - check mtime mod from tools cron job on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 SERVICE UNAVAILABLE - string OK not found on http://checker.tools.wmflabs.org:80/toolscron - 272 bytes in 0.004 second response time [13:56:02] PROBLEM - NFS read/writeable on labs instances on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 SERVICE UNAVAILABLE - string OK not found on http://checker.tools.wmflabs.org:80/nfs/home - 306 bytes in 0.007 second response time [13:56:02] ACKNOWLEDGEMENT - Host cp3009 is DOWN: PING CRITICAL - Packet loss = 100% Ema cp3009 is down: T148422 [13:56:02] PROBLEM - All k8s worker nodes are healthy on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 SERVICE UNAVAILABLE - string OK not found on http://checker.tools.wmflabs.org:80/k8s/nodes/ready - 268 bytes in 0.007 second response time [13:56:25] (03PS2) 10Marostegui: mariadb: Split backup and otrsbackups classes into a different file [puppet] - 10https://gerrit.wikimedia.org/r/320989 (https://phabricator.wikimedia.org/T150851) [13:56:28] RECOVERY - tools homepage -admin tool- on tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 3670 bytes in 0.078 second response time [13:56:58] RECOVERY - check mtime mod from tools cron job on checker.tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 166 bytes in 0.008 second response time [13:57:02] RECOVERY - NFS read/writeable on labs instances on checker.tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 166 bytes in 0.013 second response time [13:57:02] RECOVERY - All k8s worker nodes are healthy on checker.tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 166 bytes in 0.217 second response time [13:57:08] RECOVERY - Start a job and verify on Precise on checker.tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 166 bytes in 1.791 second response time [13:57:18] RECOVERY - Start a job and verify on Trusty on checker.tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 166 bytes in 4.207 second response time [13:57:38] Dereckson: are you doing EU SWAT today? I see you have merged the commits [13:58:50] (03PS2) 10Rush: labstore: add tools-home and tools-project to nfs-exportd [puppet] - 10https://gerrit.wikimedia.org/r/321875 [13:59:01] (03PS3) 10Thiemo Mättig (WMDE): Add missing $wgPropertySuggesterClassifyingPropertyIds for beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/320192 [13:59:06] (03CR) 10Rush: [C: 032 V: 032] labstore: add tools-home and tools-project to nfs-exportd [puppet] - 10https://gerrit.wikimedia.org/r/321875 (owner: 10Rush) [13:59:40] (03CR) 10Thiemo Mättig (WMDE): "Hello? Anybody aware of this?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/320192 (owner: 10Thiemo Mättig (WMDE)) [14:00:04] addshore, hashar, anomie, ostriches, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, and thcipriani: Respected human, time to deploy European Mid-day SWAT(Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20161116T1400). Please do the needful. [14:00:04] Urbanecm: A patch you scheduled for European Mid-day SWAT(Max 8 patches) is about to be deployed. Please be available during the process. [14:00:17] Present. [14:00:43] I can SWAT today, if Dereckson did not plan to do it [14:01:00] since he has already merged the commits [14:01:17] will wait a few more minutes and start the SWAT, if he does not reply [14:02:17] Sure, it's okay. [14:03:06] well, he did not reply in 5 minutes, I am assuming he is not around, starting EU SWAT [14:03:37] ebernhardson: around for SWAT? [14:03:47] (03CR) 10Marostegui: "https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/4595/" [puppet] - 10https://gerrit.wikimedia.org/r/320989 (https://phabricator.wikimedia.org/T150851) (owner: 10Marostegui) [14:05:12] Dereckson: oh, my mistake, looking at the wrong SWAT window, please ignore me :) [14:05:37] Urbanecm_: sorry, I was looking at the wrong SWAT window, you are the only one for this SWAT :) [14:06:06] Okay, so I'm ready :) [14:06:24] (03CR) 10Zfilipin: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/318515 (https://phabricator.wikimedia.org/T149019) (owner: 10Cenarium) [14:07:04] (03Merged) 10jenkins-bot: Remove patrol from autoconfirmed and reviewer for enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/318515 (https://phabricator.wikimedia.org/T149019) (owner: 10Cenarium) [14:07:06] Urbanecm_: merging 318515, can you test it at mw1099, once it is there? [14:07:30] Yes. [14:07:34] Could I start testing? [14:07:37] if anyone decides they need to scap wmf-config, they (may) need to poke me, not sure... [14:08:32] apergos: I am doing SWAT for this https://gerrit.wikimedia.org/r/#/c/318515/ [14:08:40] Urbanecm_: just a minute [14:08:59] Sure. [14:09:14] zeljkof: lemme ask someone with more of a clue than me [14:09:17] give me 1 minute [14:09:32] apergos: sure, waiting [14:09:47] Urbanecm_: waiting until I see if I can continue with SWAT [14:10:18] zeljkof: Ok. [14:10:32] everything is ready, I am just not sure if I can continue [14:10:41] zeljkof: Ok [14:11:41] sorry. you will be able to, it's just whether I have to do any clenaup first [14:12:04] the expert says it's fine so [14:12:15] (03PS1) 10Muehlenhoff: package_builder: Add pkg-kde-tools to list of installed packages [puppet] - 10https://gerrit.wikimedia.org/r/321876 [14:12:30] (03PS2) 10Muehlenhoff: package_builder: Add pkg-kde-tools to list of installed packages [puppet] - 10https://gerrit.wikimedia.org/r/321876 [14:13:00] apergos: ok, continuing with EU SWAT cc Urbanecm_ [14:13:10] thanks [14:13:13] sorry for false alarm [14:13:24] all clear [14:13:24] apergos: no problem, better safe than sorry :) [14:13:25] Ok, waiting for pulling at mw1099 [14:13:26] clarified [14:13:33] just git rebase [14:13:34] Hi hashar :) [14:13:38] and scap pull on mw1099 and you are all set [14:13:41] * hashar waves [14:14:58] Urbanecm_: sorry for the delay, 318515 is at mw1099, please test [14:15:06] Testing [14:15:46] (03CR) 10Muehlenhoff: [C: 032] package_builder: Add pkg-kde-tools to list of installed packages [puppet] - 10https://gerrit.wikimedia.org/r/321876 (owner: 10Muehlenhoff) [14:16:33] zeljkof: 318515 works, please deploy to the whole universe. [14:17:06] Urbanecm_: deploying to the universe, known and unknown [14:17:28] PROBLEM - NTP on cp4017 is CRITICAL: NTP CRITICAL: Offset unknown [14:18:19] !log zfilipin@tin Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:318515|Remove patrol from autoconfirmed and reviewer for enwiki (T149019)]] (duration: 00m 49s) [14:18:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:18:38] T149019: Add the patroller group to the English Wikipedia - https://phabricator.wikimedia.org/T149019 [14:18:46] Urbanecm_: deployed, 318515 is where no patch has gone before [14:18:49] please test :) [14:19:34] Thanks a lot zeljkof [14:19:35] All works! [14:19:53] Great! That is all I guess then [14:20:27] !log EU SWAT finished [14:20:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:21:46] (03PS1) 10Jcrespo: beta-mysql: Enable unix_socket authentication [puppet] - 10https://gerrit.wikimedia.org/r/321878 (https://phabricator.wikimedia.org/T150446) [14:23:28] PROBLEM - Check whether ferm is active by checking the default input chain on db1066 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [14:23:58] (03CR) 10Marostegui: [C: 031] beta-mysql: Enable unix_socket authentication [puppet] - 10https://gerrit.wikimedia.org/r/321878 (https://phabricator.wikimedia.org/T150446) (owner: 10Jcrespo) [14:24:12] db1066 seems to be lagging quite a bit, see https://en.wikipedia.org/w/api.php?action=query&meta=siteinfo&siprop=dbrepllag&sishowalldb= (now 167). I noticed my bot started waiting..... [14:24:18] RECOVERY - Check whether ferm is active by checking the default input chain on db1066 is OK: OK ferm input default policy is set [14:24:27] (03PS1) 10Muehlenhoff: package_builder: Add subversion to list of installed packages [puppet] - 10https://gerrit.wikimedia.org/r/321880 [14:24:45] (03PS2) 10Jcrespo: beta-mysql: Enable unix_socket authentication [puppet] - 10https://gerrit.wikimedia.org/r/321878 (https://phabricator.wikimedia.org/T150446) [14:25:07] multichill: it is good now: Seconds_Behind_Master: 0 could be temporary? [14:25:48] (03CR) 10Muehlenhoff: [C: 032] package_builder: Add subversion to list of installed packages [puppet] - 10https://gerrit.wikimedia.org/r/321880 (owner: 10Muehlenhoff) [14:26:23] (03Draft2) 10Urbanecm: Autopatrolled group for et.wikipedia.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/321879 (https://phabricator.wikimedia.org/T150852) [14:27:05] (03Abandoned) 10Urbanecm: Autopatrolled group for et.wikipedia.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/321879 (https://phabricator.wikimedia.org/T150852) (owner: 10Urbanecm) [14:27:08] (03Restored) 10Urbanecm: Autopatrolled group for et.wikipedia.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/321879 (https://phabricator.wikimedia.org/T150852) (owner: 10Urbanecm) [14:27:46] marostegui: Let's hope it just a glitch, but seems to be going up again [14:29:08] zeljkof: I noticed T150852 just now. Can 321879 be deployed in EU SWAT (formally it did not end as it ends at 15:00 UTC) or should I schedule it to tomorrow window? [14:29:08] T150852: Autopatrolled group for et.wikipedia.org - https://phabricator.wikimedia.org/T150852 [14:29:48] Urbanecm_: I am still around, please add it to the calendar and I will deploy it :) [14:30:56] zeljkof: Added [14:31:26] !log Starting EU SWAT, part two! [14:31:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:33:48] (03CR) 10Zfilipin: "recheck" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/321879 (https://phabricator.wikimedia.org/T150852) (owner: 10Urbanecm) [14:34:42] (03CR) 10Zfilipin: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/321879 (https://phabricator.wikimedia.org/T150852) (owner: 10Urbanecm) [14:34:56] (03PS1) 10Rush: tools: when establishing /home from NFS force creation [puppet] - 10https://gerrit.wikimedia.org/r/321883 (https://phabricator.wikimedia.org/T150829) [14:35:08] PROBLEM - NTP on cp1066 is CRITICAL: NTP CRITICAL: Offset unknown [14:35:19] (03Merged) 10jenkins-bot: Autopatrolled group for et.wikipedia.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/321879 (https://phabricator.wikimedia.org/T150852) (owner: 10Urbanecm) [14:35:53] multichill: there is definitely more activity in that server than usual indeed [14:36:14] Urbanecm_: about to push 321879 to mw1099, can you test it there? [14:36:16] ffs why gerrit takes so much time to scroll big files to be able to add line comments [14:36:39] (03PS2) 10Rush: tools: when establishing links to NFS force creation [puppet] - 10https://gerrit.wikimedia.org/r/321883 (https://phabricator.wikimedia.org/T150829) [14:36:46] zeljkof: I should be able to do it. [14:36:58] wgaddgroups wgremovegroups should let sysop add/remove autopatrolled [14:37:04] (03PS3) 10Rush: tools: when establishing links to NFS force creation [puppet] - 10https://gerrit.wikimedia.org/r/321883 (https://phabricator.wikimedia.org/T150829) [14:37:12] and thats not in the change merged [14:37:20] marostegui: I'm using pywikibot and it respects the replag. So it when db server is sick, I'll notice that. Seems to be coming in waves. Now it's good [14:37:23] (03PS3) 10Jcrespo: beta-mysql: Enable unix_socket authentication [puppet] - 10https://gerrit.wikimedia.org/r/321878 (https://phabricator.wikimedia.org/T150446) [14:37:28] RECOVERY - NTP on cp4017 is OK: NTP OK: Offset -0.0004454851151 secs [14:37:42] Urbanecm_: it is there, please test [14:37:46] multichill: Yep, now it is fine, but it has been having spikes in disk activity (and repl lag) [14:37:48] PROBLEM - NTP on cp2019 is CRITICAL: NTP CRITICAL: Offset unknown [14:37:49] Going to do it zeljkof [14:37:51] ^ [14:37:58] Urbanecm [14:38:28] arseny92: What was for me? [14:38:31] zeljkof: All works. [14:38:32] yes [14:38:40] (03CR) 10Andrew Bogott: [C: 032] tools: when establishing links to NFS force creation [puppet] - 10https://gerrit.wikimedia.org/r/321883 (https://phabricator.wikimedia.org/T150829) (owner: 10Rush) [14:38:42] Urbanecm_: ok, deploying [14:38:49] arseny92: WHAT was for me. Not was it for me :) [14:40:15] !log zfilipin@tin Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:321879|Autopatrolled group for et.wikipedia.org (T150852)]] (duration: 00m 51s) [14:40:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:40:34] T150852: Autopatrolled group for et.wikipedia.org - https://phabricator.wikimedia.org/T150852 [14:40:36] Urbanecm_: deployed to the interplanetary network, please check [14:41:10] [16:36] ffs why gerrit takes so much time to scroll big files to be able to add line comments [14:41:11] [16:36] wgaddgroups wgremovegroups should let sysop add/remove autopatrolled [14:41:17] [16:37] and thats not in the change merged [14:41:23] arseny92: Opsss... [14:41:33] zeljkof: Don't end the SWAT, preparing fix up change :D [14:41:43] Urbanecm_: :D [14:41:45] Faster submiting than thinging... [14:41:55] ok, I'm around [14:42:18] PROBLEM - NTP on cp3040 is CRITICAL: NTP CRITICAL: Offset unknown [14:44:29] (03CR) 10Hashar: "Others? Not sure what you mean :} That is solely used on contint1001 / Jessie." [puppet] - 10https://gerrit.wikimedia.org/r/321650 (https://phabricator.wikimedia.org/T150727) (owner: 10Hashar) [14:45:06] (03CR) 10Ottomata: "> That one means that one broker down == The entire service (along with whatever other service uses topics that reside on that broker) dow" [puppet] - 10https://gerrit.wikimedia.org/r/320690 (https://phabricator.wikimedia.org/T143925) (owner: 10Ottomata) [14:45:08] RECOVERY - NTP on cp1066 is OK: NTP OK: Offset -0.000256061554 secs [14:46:05] (03Draft2) 10Urbanecm: Fix autopatrolled for etwiki - 321879 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/321884 (https://phabricator.wikimedia.org/T150852) [14:47:15] (03PS3) 10Urbanecm: Fix autopatrolled for etwiki - 321879 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/321884 (https://phabricator.wikimedia.org/T150852) [14:47:17] (03PS4) 10Jcrespo: beta-mysql: Enable unix_socket authentication [puppet] - 10https://gerrit.wikimedia.org/r/321878 (https://phabricator.wikimedia.org/T150446) [14:47:39] zeljkof: Deploy 321884 please... [14:47:41] (03PS1) 10Alexandros Kosiaris: icinga: Increase NTP check intervals [puppet] - 10https://gerrit.wikimedia.org/r/321885 [14:47:48] RECOVERY - NTP on cp2019 is OK: NTP OK: Offset -0.0001450181007 secs [14:47:53] (03CR) 10Arseny1992: Fix autopatrolled for etwiki - 321879 (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/321884 (https://phabricator.wikimedia.org/T150852) (owner: 10Urbanecm) [14:47:53] zeljkof: Adding to calendar is in progress. [14:48:09] (03CR) 10Alexandros Kosiaris: [C: 032 V: 032] icinga: Increase NTP check intervals [puppet] - 10https://gerrit.wikimedia.org/r/321885 (owner: 10Alexandros Kosiaris) [14:48:16] ^ [14:48:32] arseny92: Does you see whitespace still? I think I removed it... [14:48:45] (03PS5) 10Jcrespo: beta-mysql: Enable unix_socket authentication [puppet] - 10https://gerrit.wikimedia.org/r/321878 (https://phabricator.wikimedia.org/T150446) [14:48:56] Urbanecm_: on it [14:49:06] I posted on same second you updated [14:49:12] (03CR) 10Ottomata: "Oh, and as for expense of consumer connection, I don't think it should be much. But, this is the first time that a public internet http r" [puppet] - 10https://gerrit.wikimedia.org/r/320690 (https://phabricator.wikimedia.org/T143925) (owner: 10Ottomata) [14:49:13] arseny92: So all is okay? [14:49:52] (03CR) 10Arseny1992: [C: 031] Fix autopatrolled for etwiki - 321879 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/321884 (https://phabricator.wikimedia.org/T150852) (owner: 10Urbanecm) [14:50:17] now yes ^ [14:50:47] Okay, zeljkof, please start the deploying process. [14:51:02] (03CR) 10Jcrespo: [C: 032] beta-mysql: Enable unix_socket authentication [puppet] - 10https://gerrit.wikimedia.org/r/321878 (https://phabricator.wikimedia.org/T150446) (owner: 10Jcrespo) [14:51:09] Urbanecm_: ok, please add the patch to the wiki [14:51:35] zeljkof: Added. [14:51:43] (03CR) 10Zfilipin: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/321884 (https://phabricator.wikimedia.org/T150852) (owner: 10Urbanecm) [14:52:18] RECOVERY - NTP on cp3040 is OK: NTP OK: Offset 0.0008793771267 secs [14:52:18] (03Merged) 10jenkins-bot: Fix autopatrolled for etwiki - 321879 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/321884 (https://phabricator.wikimedia.org/T150852) (owner: 10Urbanecm) [14:52:33] (03PS8) 10Faidon Liambotis: Split check_ssl between traditional year-long certs and LE's 3 month certs [puppet] - 10https://gerrit.wikimedia.org/r/313805 (https://phabricator.wikimedia.org/T144293) (owner: 10Alex Monk) [14:52:40] (03CR) 10Faidon Liambotis: [C: 032] "Thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/313805 (https://phabricator.wikimedia.org/T144293) (owner: 10Alex Monk) [14:52:42] 06Operations, 06Commons, 10MediaWiki-File-management, 06Multimedia, 07Upstream: Issues with displaying thumbnails for CMYK JPG images due to buggy version of ImageMagick (black horizontal stripes, black color missing) - https://phabricator.wikimedia.org/T141739#2799251 (10MoritzMuehlenhoff) A build with... [14:53:23] 06Operations, 06Security-Team, 13Patch-For-Review: Create cronjob for regular captcha regeneration - https://phabricator.wikimedia.org/T150029#2799266 (10Reedy) @aaron Do we need to delete the old ones etc too? [14:54:08] PROBLEM - NTP on cp4008 is CRITICAL: NTP CRITICAL: Offset unknown [14:55:39] Urbanecm_: 321884 is at mw1099, can you test it there? [14:55:54] zeljkof: Sure. Testing is in progress. [14:56:16] 06Operations, 10ops-eqiad, 10hardware-requests: Return wmf4747/wmf4748/wmf4749/wmf4750 to spares - https://phabricator.wikimedia.org/T146171#2799275 (10faidon) Ping? [14:56:29] zeljkof: It works. [14:56:36] Urbanecm_: deploing... [14:57:37] 06Operations, 06Commons, 10MediaWiki-File-management, 06Multimedia, 07Upstream: Issues with displaying thumbnails for CMYK JPG images due to buggy version of ImageMagick (black horizontal stripes, black color missing) - https://phabricator.wikimedia.org/T141739#2799276 (10MoritzMuehlenhoff) That fixes th... [14:58:00] !log zfilipin@tin Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:321884|Autopatrolled group for et.wikipedia.org (T150852)]] (duration: 00m 55s) [14:58:02] Eating the whole SWAT window is really very easy... [14:58:13] Urbanecm_: :D deployed, please test [14:58:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:58:21] T150852: Autopatrolled group for et.wikipedia.org - https://phabricator.wikimedia.org/T150852 [14:58:25] zeljkof: Testing. [14:58:42] https://et.wikipedia.org/wiki/Eri:Kasutajar%C3%BChma_%C3%B5igused?uselang=en [14:58:51] seem to be working [14:59:00] zeljkof: arseny92 Yes it works. [14:59:05] Thanks for your work zeljkof ! [14:59:13] Urbanecm_: great! [14:59:18] PROBLEM - NTP on cp2016 is CRITICAL: NTP CRITICAL: Offset unknown [14:59:23] they'd need to create wikipages for the groups tho [14:59:24] !log EU SWAT finished! For real, this time. [14:59:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:00:39] arseny92: I think this is their problem. It is translated and I think anything else isn't needed to be enforced from our side. [15:02:39] From our side not. Just note on the task they'd need to create local pages for redlink groups linked from ListGroupRights [15:04:08] RECOVERY - NTP on cp4008 is OK: NTP OK: Offset -0.0001659691334 secs [15:04:40] !log rolling cache_upload upgrade to varnish 4.1.3-1wm4 and reboot with linux 4.4.2-3+wmf7 [15:05:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:06:06] Urbanecm ^ [15:07:43] PROBLEM - NTP on cp3033 is CRITICAL: NTP CRITICAL: Offset unknown [15:09:13] RECOVERY - NTP on cp2016 is OK: NTP OK: Offset -0.001117646694 secs [15:09:59] (03PS2) 10Ori.livneh: Re-enable AbuseFilterCachingParser everywhere [mediawiki-config] - 10https://gerrit.wikimedia.org/r/321728 [15:10:54] (03CR) 10Ori.livneh: [C: 032] Re-enable AbuseFilterCachingParser everywhere [mediawiki-config] - 10https://gerrit.wikimedia.org/r/321728 (owner: 10Ori.livneh) [15:11:39] (03Merged) 10jenkins-bot: Re-enable AbuseFilterCachingParser everywhere [mediawiki-config] - 10https://gerrit.wikimedia.org/r/321728 (owner: 10Ori.livneh) [15:11:46] arseny92: Okay, I'll note it... [15:13:48] arseny92: Noted. [15:17:09] (03PS1) 10Rush: nfs-exportd: ensure running and start on boot [puppet] - 10https://gerrit.wikimedia.org/r/321886 (https://phabricator.wikimedia.org/T150829) [15:17:19] (03PS2) 10Rush: nfs-exportd: ensure running and start on boot [puppet] - 10https://gerrit.wikimedia.org/r/321886 (https://phabricator.wikimedia.org/T150829) [15:18:39] (03CR) 10jenkins-bot: [V: 04-1] nfs-exportd: ensure running and start on boot [puppet] - 10https://gerrit.wikimedia.org/r/321886 (https://phabricator.wikimedia.org/T150829) (owner: 10Rush) [15:18:55] 06Operations, 10OCG-General, 06Wiktionary, 13Patch-For-Review: Download as PDF does not work in English Wiktionary: "There was an error while attempting to render your book." - https://phabricator.wikimedia.org/T150604#2799315 (10Marostegui) p:05Unbreak!>03High [15:20:00] (03PS3) 10Rush: nfs-exportd: ensure running and start on boot [puppet] - 10https://gerrit.wikimedia.org/r/321886 (https://phabricator.wikimedia.org/T150829) [15:20:38] !log ori@tin Synchronized wmf-config/InitialiseSettings.php: I968050af3f: Re-enable AbuseFilterCachingParser everywhere (duration: 00m 50s) [15:20:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:22:45] (03PS4) 10Rush: nfs-exportd: ensure running and start on boot [puppet] - 10https://gerrit.wikimedia.org/r/321886 (https://phabricator.wikimedia.org/T150829) [15:22:54] (03PS1) 10Jcrespo: mariadb: Remove /root/.my.cnf from all servers [puppet] - 10https://gerrit.wikimedia.org/r/321888 (https://phabricator.wikimedia.org/T150446) [15:24:11] (03CR) 10Rush: [C: 032] nfs-exportd: ensure running and start on boot [puppet] - 10https://gerrit.wikimedia.org/r/321886 (https://phabricator.wikimedia.org/T150829) (owner: 10Rush) [15:24:26] (03CR) 10jenkins-bot: [V: 04-1] mariadb: Remove /root/.my.cnf from all servers [puppet] - 10https://gerrit.wikimedia.org/r/321888 (https://phabricator.wikimedia.org/T150446) (owner: 10Jcrespo) [15:27:22] PROBLEM - NTP on cp2007 is CRITICAL: NTP CRITICAL: Offset unknown [15:28:08] 06Operations, 06Labs, 13Patch-For-Review, 07Tracking: Migrate tools to secondary labstore HA cluster (Scheduled on 11/14) [tracking] - https://phabricator.wikimedia.org/T146154#2799324 (10chasemp) [15:30:10] 06Operations, 06Labs, 13Patch-For-Review, 07Tracking: overhaul labstore setup [tracking] - https://phabricator.wikimedia.org/T126083#2799327 (10chasemp) [15:30:13] 06Operations, 06Labs, 07Tracking: Sync data for tools-project from labstore1001 to labstore1004/5 - https://phabricator.wikimedia.org/T144255#2799325 (10chasemp) 05Open>03Resolved This was done on sunday for a sync within 24 hours of main maint for Tools. The actual outage period sync took around 5h for... [15:30:28] (03PS2) 10Jcrespo: mariadb: Remove /root/.my.cnf from all servers [puppet] - 10https://gerrit.wikimedia.org/r/321888 (https://phabricator.wikimedia.org/T150446) [15:31:40] (03CR) 10jenkins-bot: [V: 04-1] mariadb: Remove /root/.my.cnf from all servers [puppet] - 10https://gerrit.wikimedia.org/r/321888 (https://phabricator.wikimedia.org/T150446) (owner: 10Jcrespo) [15:31:47] 06Operations, 06Labs, 13Patch-For-Review, 07Tracking: overhaul labstore setup [tracking] - https://phabricator.wikimedia.org/T126083#2004220 (10chasemp) [15:31:50] 06Operations, 06Labs, 13Patch-For-Review: revise/fix labstore replicate backup jobs - https://phabricator.wikimedia.org/T127567#2799330 (10chasemp) 05Open>03Resolved A bit of monitoring improvements ongoing in {T144633} but generally this is done. [15:32:28] 06Operations, 06Labs, 13Patch-For-Review, 07Tracking: Migrate tools to secondary labstore HA cluster (Scheduled on 11/14) [tracking] - https://phabricator.wikimedia.org/T146154#2799335 (10chasemp) [15:37:42] RECOVERY - NTP on cp3033 is OK: NTP OK: Offset -0.0002483129501 secs [15:41:54] (03PS1) 10Marostegui: db-codfw.php: Repool db2066 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/321889 (https://phabricator.wikimedia.org/T150518) [15:43:22] PROBLEM - traffic-pool service on cp1067 is CRITICAL: CRITICAL - Expecting active but unit traffic-pool is failed [15:43:40] fixing ^ [15:43:53] 06Operations, 06Labs, 13Patch-For-Review, 07Tracking: overhaul labstore setup [tracking] - https://phabricator.wikimedia.org/T126083#2799355 (10chasemp) [15:43:56] 06Operations, 06Labs, 07Tracking: Performance test new secondary labstore HA cluster - https://phabricator.wikimedia.org/T146153#2799352 (10chasemp) 05Open>03Resolved a:03chasemp This work did not get persisted to the task here so I will attempt a brief outline for posterity. The main difficulty here... [15:44:22] RECOVERY - traffic-pool service on cp1067 is OK: OK - traffic-pool is active [15:44:32] PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 10.00% of data above the critical threshold [1000.0] [15:44:50] (03CR) 10Marostegui: [C: 032] db-codfw.php: Repool db2066 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/321889 (https://phabricator.wikimedia.org/T150518) (owner: 10Marostegui) [15:44:51] 06Operations, 06Labs, 13Patch-For-Review, 07Tracking: overhaul labstore setup [tracking] - https://phabricator.wikimedia.org/T126083#2004220 (10chasemp) [15:44:54] 06Operations, 06Labs, 13Patch-For-Review, 07Tracking: Migrate tools to secondary labstore HA cluster (Scheduled on 11/14) [tracking] - https://phabricator.wikimedia.org/T146154#2799373 (10chasemp) 05Open>03Resolved a:03chasemp Some fallout here {T150829} and I'm looking at addressing an issue w/ wher... [15:45:06] 06Operations, 10ops-codfw, 06DC-Ops, 13Patch-For-Review, 07Wikimedia-Incident: Labstore2001 controller or shelf failure - https://phabricator.wikimedia.org/T102626#2799379 (10Papaul) 05Open>03Resolved Closing this since the system is back up online. [15:45:29] (03Merged) 10jenkins-bot: db-codfw.php: Repool db2066 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/321889 (https://phabricator.wikimedia.org/T150518) (owner: 10Marostegui) [15:45:55] (03PS4) 10Ottomata: Deploy EventStreams on scb and configure LVS service in eqiad [puppet] - 10https://gerrit.wikimedia.org/r/320690 (https://phabricator.wikimedia.org/T143925) [15:46:43] 06Operations, 06Labs, 13Patch-For-Review, 07Tracking: overhaul labstore setup [tracking] - https://phabricator.wikimedia.org/T126083#2799386 (10chasemp) [15:46:46] 06Operations, 06Labs, 13Patch-For-Review, 07Tracking: Migrate tools to secondary labstore HA cluster (Scheduled on 11/14) [tracking] - https://phabricator.wikimedia.org/T146154#2799384 (10chasemp) 05Resolved>03Open On second thought this should remain open until {T149946} is done (and reverted) [15:46:52] 06Operations, 10ops-codfw: es2019 crashed again - https://phabricator.wikimedia.org/T149526#2799387 (10Papaul) p:05Triage>03Normal [15:47:18] !log marostegui@tin Synchronized wmf-config/db-codfw.php: Repool db2066 - T150518 (duration: 00m 49s) [15:47:30] 06Operations, 06Labs, 13Patch-For-Review, 07Tracking: overhaul labstore setup [tracking] - https://phabricator.wikimedia.org/T126083#2004220 (10chasemp) [15:47:31] 06Operations, 06Labs, 13Patch-For-Review: Move maps share to labstore1003 - https://phabricator.wikimedia.org/T147657#2799389 (10chasemp) 05Open>03Resolved This is done and we need to find a new home for maps as we fixup labstore1001 but the scope of this task is itself completed [15:47:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:47:38] T150518: Import S5 to dbstore2001 and dbstore2002 + compression - https://phabricator.wikimedia.org/T150518 [15:48:00] (03CR) 10Ottomata: "I removed the cache misc puppetization from this change. I think the only remaining question is role params and hiera variable namespacin" [puppet] - 10https://gerrit.wikimedia.org/r/320690 (https://phabricator.wikimedia.org/T143925) (owner: 10Ottomata) [15:48:45] (03PS1) 10Ori.livneh: Don't use AbuseFilterCachingParser on bgwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/321890 (https://phabricator.wikimedia.org/T148660) [15:49:49] (03CR) 10Ori.livneh: [C: 032] Don't use AbuseFilterCachingParser on bgwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/321890 (https://phabricator.wikimedia.org/T148660) (owner: 10Ori.livneh) [15:50:10] 06Operations, 06Labs, 13Patch-For-Review, 07Tracking: overhaul labstore setup [tracking] - https://phabricator.wikimedia.org/T126083#2799399 (10chasemp) [15:50:12] 06Operations: change nfs-exports job to only run on changes to /etc/exports.d - https://phabricator.wikimedia.org/T126085#2799397 (10chasemp) 05Open>03declined I looked at this and am of the opinion currently that while it would be a slightly cleaner nicety we are doing better on the new setup. We can reeva... [15:53:40] (03PS2) 10Ori.livneh: Don't use AbuseFilterCachingParser on bgwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/321890 (https://phabricator.wikimedia.org/T148660) [15:56:19] !log ori@tin Synchronized wmf-config/InitialiseSettings.php: I506f17f6: Don't use AbuseFilterCachingParser on bgwiki (T148660) (duration: 00m 49s) [15:56:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:56:40] T148660: Stack overflow in AbuseFilter when using AbuseFilterCachingParser - https://phabricator.wikimedia.org/T148660 [15:57:22] RECOVERY - NTP on cp2007 is OK: NTP OK: Offset -0.0004555284977 secs [15:58:32] PROBLEM - puppet last run on cp3007 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:58:52] PROBLEM - Apache HTTP on mw1197 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:59:12] PROBLEM - HHVM rendering on mw1197 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:59:32] RECOVERY - Text HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [15:59:47] "| Log: https://bit.ly.wikitech |" is nowhere [16:00:05] this very channel topic [16:00:33] that's a typo, should be https://bit.ly/wikitech [16:01:05] let's see if i have permissions to change it [16:01:52] or even maybe better to put https://tools.wmflabs.org/sal/production [16:02:21] the on-wiki SAL is the canonical version [16:02:40] (03PS3) 10Jcrespo: mariadb: Remove /root/.my.cnf from all servers [puppet] - 10https://gerrit.wikimedia.org/r/321888 (https://phabricator.wikimedia.org/T150446) [16:02:41] MatmaRex: given that you're at it, the ur1.ca link too seems to not work [16:02:58] grumble [16:03:13] !botbrain [16:03:18] grumble grumble [16:03:32] PROBLEM - puppet last run on rcs1002 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [16:03:42] PROBLEM - puppet last run on labservices1002 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[/usr/local/bin/labs-ip-alias-dump.py] [16:03:45] !log apt-get autoremove on analytics1028 [16:03:49] volans: it works for me, actually. but takes a long time to load. [16:04:02] RECOVERY - HHVM rendering on mw1197 is OK: HTTP OK: HTTP/1.1 200 OK - 72290 bytes in 0.174 second response time [16:04:02] MatmaRex yes though the tools version is linked to the wiki version [16:04:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:04:12] PROBLEM - Disk space on analytics1028 is CRITICAL: DISK CRITICAL - free space: /boot 7 MB (3% inode=99%) [16:04:25] and has searching [16:04:33] volans: i guess apache doesn't like serving directory listings with 1,781 files in them [16:04:42] RECOVERY - Apache HTTP on mw1197 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 612 bytes in 0.022 second response time [16:05:12] RECOVERY - Disk space on analytics1028 is OK: DISK OK [16:05:34] yeah now it opened for me too :) maybe adding a symlink to today's date and linking that one works better, and also splitting them in different directory at least per year if not year-month [16:08:01] !log applying schema change on s7 (page) T69223 [16:08:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:08:22] T69223: Schema change for page content language - https://phabricator.wikimedia.org/T69223 [16:11:32] PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] [16:11:50] (03CR) 10Marostegui: [C: 031] mariadb: Remove /root/.my.cnf from all servers [puppet] - 10https://gerrit.wikimedia.org/r/321888 (https://phabricator.wikimedia.org/T150446) (owner: 10Jcrespo) [16:12:32] PROBLEM - Esams HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] [16:13:08] (03PS8) 10Andrew Bogott: Explicitly set up /var/spool/gridengine on grid master [puppet] - 10https://gerrit.wikimedia.org/r/321584 [16:13:10] (03PS1) 10Andrew Bogott: Fork keystone policy so that horizon has its own keystone_policy.json [puppet] - 10https://gerrit.wikimedia.org/r/321891 [16:14:52] (03CR) 10Andrew Bogott: [C: 032] Fork keystone policy so that horizon has its own keystone_policy.json [puppet] - 10https://gerrit.wikimedia.org/r/321891 (owner: 10Andrew Bogott) [16:16:52] PROBLEM - puppet last run on db1022 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [16:17:15] (03CR) 10Hashar: [C: 04-1] "I guess we have the .htaccess in integration/docroot so they are kept in sync with the PHP entry points. That also let us tweak them dire" [puppet] - 10https://gerrit.wikimedia.org/r/321651 (https://phabricator.wikimedia.org/T150727) (owner: 10Hashar) [16:18:22] PROBLEM - puppet last run on labtestweb2001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [16:19:33] (03PS9) 10Andrew Bogott: Explicitly set up /var/spool/gridengine on grid master [puppet] - 10https://gerrit.wikimedia.org/r/321584 [16:19:35] (03PS1) 10Andrew Bogott: Horizon: Fix accidental double definition [puppet] - 10https://gerrit.wikimedia.org/r/321893 [16:22:09] (03CR) 10Andrew Bogott: [C: 032] Horizon: Fix accidental double definition [puppet] - 10https://gerrit.wikimedia.org/r/321893 (owner: 10Andrew Bogott) [16:22:32] PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 10.00% of data above the critical threshold [1000.0] [16:26:04] (03PS10) 10Andrew Bogott: Explicitly set up /var/spool/gridengine on grid master [puppet] - 10https://gerrit.wikimedia.org/r/321584 [16:26:06] (03PS1) 10Andrew Bogott: horizon: Add a file I forgot in a previous patch. [puppet] - 10https://gerrit.wikimedia.org/r/321894 [16:26:19] (03PS4) 10Rush: labs: add ores_classification and ores_model tables [puppet] - 10https://gerrit.wikimedia.org/r/320804 (https://phabricator.wikimedia.org/T148561) (owner: 10Ladsgroup) [16:26:32] RECOVERY - puppet last run on cp3007 is OK: OK: Puppet is currently enabled, last run 1 second ago with 0 failures [16:26:32] PROBLEM - puppet last run on californium is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/etc/openstack-dashboard/keystone_policy.json] [16:28:32] PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 10.00% of data above the critical threshold [1000.0] [16:29:00] (03CR) 10Rush: [C: 032 V: 032] labs: add ores_classification and ores_model tables [puppet] - 10https://gerrit.wikimedia.org/r/320804 (https://phabricator.wikimedia.org/T148561) (owner: 10Ladsgroup) [16:30:05] (03CR) 10Andrew Bogott: [C: 032] horizon: Add a file I forgot in a previous patch. [puppet] - 10https://gerrit.wikimedia.org/r/321894 (owner: 10Andrew Bogott) [16:30:09] (03PS2) 10Andrew Bogott: horizon: Add a file I forgot in a previous patch. [puppet] - 10https://gerrit.wikimedia.org/r/321894 [16:30:32] 06Operations, 10Continuous-Integration-Infrastructure, 06Release-Engineering-Team, 10hardware-requests: codfw: 1 hardware access request for continuous integration - https://phabricator.wikimedia.org/T150865#2799514 (10hashar) [16:31:32] RECOVERY - puppet last run on rcs1002 is OK: OK: Puppet is currently enabled, last run 14 seconds ago with 0 failures [16:31:42] RECOVERY - puppet last run on labservices1002 is OK: OK: Puppet is currently enabled, last run 56 seconds ago with 0 failures [16:32:31] 06Operations, 10Continuous-Integration-Infrastructure, 06Release-Engineering-Team, 10hardware-requests: codfw: 1 hardware access request for continuous integration - https://phabricator.wikimedia.org/T150865#2799534 (10hashar) contint1001 is a rather large machine and I am not aware of what is available in... [16:33:42] PROBLEM - Apache HTTP on mw1099 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 50431 bytes in 0.005 second response time [16:34:31] 06Operations, 10ArchCom-RfC, 06Commons, 10MediaWiki-File-management, and 14 others: Define an official thumb API - https://phabricator.wikimedia.org/T66214#2799548 (10Fjalapeno) I was reading over some of the Strawman API was wondering, is the response going to specify the file type? I couldn't quite tell... [16:34:32] PROBLEM - Esams HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 10.00% of data above the critical threshold [1000.0] [16:34:42] RECOVERY - Apache HTTP on mw1099 is OK: HTTP OK: HTTP/1.1 200 OK - 438 bytes in 0.026 second response time [16:36:22] RECOVERY - puppet last run on labtestweb2001 is OK: OK: Puppet is currently enabled, last run 9 seconds ago with 0 failures [16:36:32] RECOVERY - puppet last run on californium is OK: OK: Puppet is currently enabled, last run 2 seconds ago with 0 failures [16:38:48] (03CR) 10Andrew Bogott: [C: 031] "This is a little bit confusing, but the logic of using an exec to verify presence of the mount seems reasonable." [puppet] - 10https://gerrit.wikimedia.org/r/321786 (owner: 10Rush) [16:39:32] RECOVERY - Text HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [16:40:20] 06Operations, 10Monitoring, 10Traffic, 07HTTPS, 13Patch-For-Review: adjust ssl certificate montioring to differentiate between standard and LE certificates. - https://phabricator.wikimedia.org/T144293#2799556 (10AlexMonk-WMF) 05Open>03Resolved a:03AlexMonk-WMF [16:41:27] (03PS1) 10Reedy: Try again with GeoIP in AuthManagerLoginAuthenticateAudit [mediawiki-config] - 10https://gerrit.wikimedia.org/r/321895 [16:42:57] (03PS6) 10Rush: gridengine: refactor of init.pp for toollabs module [puppet] - 10https://gerrit.wikimedia.org/r/321786 [16:43:05] 06Operations, 10ArchCom-RfC, 06Commons, 10MediaWiki-File-management, and 14 others: Define an official thumb API - https://phabricator.wikimedia.org/T66214#2799572 (10Anomie) >>! In T66214#2799548, @Fjalapeno wrote: > I was reading over some of the Strawman API was wondering, is the response going to spec... [16:44:03] (03CR) 10Chad: [C: 031] "Harmless, worst case it returns an empty string. Deploy at will." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/321895 (owner: 10Reedy) [16:44:52] RECOVERY - puppet last run on db1022 is OK: OK: Puppet is currently enabled, last run 40 seconds ago with 0 failures [16:46:32] PROBLEM - puppet last run on cp3045 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 3 minutes ago with 1 failures. Failed resources (up to 3 shown): Service[varnish] [16:46:52] 06Operations, 10ops-eqiad, 10hardware-requests: Return wmf4747/wmf4748/wmf4749/wmf4750 to spares - https://phabricator.wikimedia.org/T146171#2799589 (10RobH) I'll step through the decommission steps shortly and ensure the systems are removed from everything up to the wipe step, then reassign this to Chris. [16:46:52] PROBLEM - Varnish HTTP upload-backend - port 3128 on cp3045 is CRITICAL: connect to address 10.20.0.180 and port 3128: Connection refused [16:49:13] (03PS2) 10Reedy: Try again with GeoIP in AuthManagerLoginAuthenticateAudit [mediawiki-config] - 10https://gerrit.wikimedia.org/r/321895 [16:51:30] 06Operations, 10ops-codfw, 10DBA: db2035: RAID disk about to fail - https://phabricator.wikimedia.org/T150511#2799602 (10Papaul) Dear Mr Papaul Tshibamba, Thank you for contacting Hewlett Packard Enterprise for your service request. This email confirms your request for service and the details are below. Yo... [16:53:17] (03PS1) 10RobH: Return wmf4747/wmf4748/wmf4749/wmf4750 to spares [puppet] - 10https://gerrit.wikimedia.org/r/321896 [16:53:58] 06Operations, 06Labs, 10wikitech.wikimedia.org: Can't login wikitech - https://phabricator.wikimedia.org/T144805#2799605 (10Krenair) @Shizhao, we don't use Extension:TwoFactorAuthentication for 2FA, we use Extension:OATHAuth. But either way I have no reason to believe this is a problem with the software itself. [16:54:05] (03CR) 10Reedy: [C: 032] Try again with GeoIP in AuthManagerLoginAuthenticateAudit [mediawiki-config] - 10https://gerrit.wikimedia.org/r/321895 (owner: 10Reedy) [16:54:42] (03Merged) 10jenkins-bot: Try again with GeoIP in AuthManagerLoginAuthenticateAudit [mediawiki-config] - 10https://gerrit.wikimedia.org/r/321895 (owner: 10Reedy) [16:56:52] RECOVERY - Varnish HTTP upload-backend - port 3128 on cp3045 is OK: HTTP OK: HTTP/1.1 200 OK - 177 bytes in 0.167 second response time [16:57:01] (03PS1) 10RobH: Return wmf4747/wmf4748/wmf4749/wmf4750 to spares [dns] - 10https://gerrit.wikimedia.org/r/321897 [16:57:19] (03CR) 10RobH: [C: 032] Return wmf4747/wmf4748/wmf4749/wmf4750 to spares [puppet] - 10https://gerrit.wikimedia.org/r/321896 (owner: 10RobH) [16:57:43] (03PS2) 10Rush: Adding views for two PageAssessments tables for Labs [puppet] - 10https://gerrit.wikimedia.org/r/321845 (owner: 10Kaldari) [16:58:32] RECOVERY - puppet last run on cp3045 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [16:58:35] (03CR) 10RobH: [C: 032] Return wmf4747/wmf4748/wmf4749/wmf4750 to spares [dns] - 10https://gerrit.wikimedia.org/r/321897 (owner: 10RobH) [16:59:41] (03PS1) 10Reedy: Revert "Try again with GeoIP in AuthManagerLoginAuthenticateAudit" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/321898 [16:59:45] (03CR) 10Reedy: [C: 032] Revert "Try again with GeoIP in AuthManagerLoginAuthenticateAudit" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/321898 (owner: 10Reedy) [17:00:30] (03CR) 10Rush: [C: 032] gridengine: refactor of init.pp for toollabs module [puppet] - 10https://gerrit.wikimedia.org/r/321786 (owner: 10Rush) [17:00:36] (03PS7) 10Rush: gridengine: refactor of init.pp for toollabs module [puppet] - 10https://gerrit.wikimedia.org/r/321786 [17:00:39] (03CR) 10Rush: [V: 032] gridengine: refactor of init.pp for toollabs module [puppet] - 10https://gerrit.wikimedia.org/r/321786 (owner: 10Rush) [17:00:49] (03Merged) 10jenkins-bot: Revert "Try again with GeoIP in AuthManagerLoginAuthenticateAudit" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/321898 (owner: 10Reedy) [17:03:52] PROBLEM - IPv4 ping to codfw on ripe-atlas-codfw is CRITICAL: CRITICAL - failed 41 probes of 411 (alerts on 19) - https://atlas.ripe.net/measurements/1791210/#!map [17:05:46] !log reedy@tin Synchronized wmf-config/CommonSettings.php: Fix GeoIP (duration: 00m 49s) [17:06:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:06:23] (03PS2) 10Chad: Remove more ancient unreferenced fundraising cruft [mediawiki-config] - 10https://gerrit.wikimedia.org/r/321733 [17:06:26] PROBLEM - Host db2049 is DOWN: PING CRITICAL - Packet loss = 100% [17:07:33] (03CR) 10Chad: [C: 032] Remove more ancient unreferenced fundraising cruft [mediawiki-config] - 10https://gerrit.wikimedia.org/r/321733 (owner: 10Chad) [17:08:05] (03Merged) 10jenkins-bot: Remove more ancient unreferenced fundraising cruft [mediawiki-config] - 10https://gerrit.wikimedia.org/r/321733 (owner: 10Chad) [17:08:18] 06Operations, 10Electron-PDFs, 10Security-Reviews, 06Services (blocked), 15User-mobrovac: Productize the Electron PDF render service & create a REST API end point - https://phabricator.wikimedia.org/T142226#2799670 (10JKatzWMF) [17:08:26] RECOVERY - Host db2049 is UP: PING OK - Packet loss = 0%, RTA = 36.34 ms [17:08:46] RECOVERY - IPv4 ping to codfw on ripe-atlas-codfw is OK: OK - failed 2 probes of 411 (alerts on 19) - https://atlas.ripe.net/measurements/1791210/#!map [17:10:25] !log demon@tin Synchronized docroot/foundation/: rm more fundraising junks (duration: 00m 54s) [17:10:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:10:56] PROBLEM - MariaDB Slave SQL: s2 on db2049 is CRITICAL: CRITICAL slave_sql_state could not connect [17:11:06] PROBLEM - MariaDB Slave IO: s2 on db2049 is CRITICAL: CRITICAL slave_io_state could not connect [17:11:11] PROBLEM - mysqld processes on db2049 is CRITICAL: PROCS CRITICAL: 0 processes with command name mysqld [17:11:50] marostegui: is that you? ^^^ [17:13:22] !log reedy@tin Synchronized wmf-config/CommonSettings.php: change geoip name to stop upsetting ES (duration: 00m 48s) [17:13:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:14:08] someone run mysql stop [17:14:21] jynus: uptime 6 minute [17:14:26] *minutes [17:14:44] that explains the PROBLEM - Host db2049 is DOWN as well [17:14:47] so ... crash ? [17:15:13] reboot system boot 3.13.0-100-gener Wed Nov 16 17:07 - 17:14 (00:07) [17:15:17] no, reboot [17:15:55] last one to loging before reboot was moritzm [17:16:14] but 6h before for 10 minutes [17:16:16] nah, it was this morning [17:16:20] exactly [17:16:40] mgmt? [17:17:46] PROBLEM - MariaDB Slave Lag: s2 on db2049 is CRITICAL: CRITICAL slave_sql_lag could not connect [17:17:57] papaul: do you know anything about it? [17:19:14] volans: no [17:19:41] 06Operations, 10OfflineContentGenerator, 10Reading-Community-Engagement, 06Reading-Web-Backlog, 06Services: [EPIC] Replicate core OCG features and sunset OCG service - https://phabricator.wikimedia.org/T150871#2799727 (10Aklapper) [17:20:11] 06Operations, 10ops-eqiad, 10hardware-requests: Return wmf4747/wmf4748/wmf4749/wmf4750 to spares - https://phabricator.wikimedia.org/T146171#2799731 (10RobH) a:05RobH>03Cmjohnson Ok, all decom steps on https://wikitech.wikimedia.org/wiki/Server_Lifecycle have been done, except the wiping of the disks. A... [17:21:20] "Power on request received by: Automatic Power Recovery." [17:21:28] Server power removed. [17:21:32] :| [17:21:33] Server reset. [17:21:47] PROBLEM - mobileapps endpoints health on scb1003 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [17:21:53] so was not a clean reboot, mysql didn't shutdown properly I guess [17:22:24] IPMI/RMCP login by root [17:22:38] no, that is not today^ [17:22:42] severity=Critical [17:22:42] date=11/16/2016 [17:22:42] time=17:01 [17:22:42] description=Automatic Operating System Shutdown Initiated Due to Overheat Condition [17:22:56] ah, I didn't get that [17:23:06] show system1/log1/record7 [17:23:09] ok, let's file a task [17:23:14] can't say I am very happy with iLO logging [17:23:19] as in, I will file one [17:23:25] and fix/check tomorrow [17:23:27] never was... always searching where it has stuff hidden in [17:23:30] jynus: ok [17:23:42] description=System Overheating (Temperature Sensor 18, Location System, Temperature 127C) [17:23:59] whaaat? 127? [17:24:01] Critical Temperature Threshold Exceeded (Temperature Sensor 18, Location System, Temperature 127C) [17:24:06] and I was wondering where to boil my eggs on [17:24:10] XD [17:24:21] papaul, are you sure texas is not in fire? [17:24:43] check the PDU temp graphs :) [17:24:56] PROBLEM - cxserver endpoints health on scb1003 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [17:25:44] [Wed Nov 16 17:26:12 2016] CPU6: Package temperature above threshold, cpu clock throttled (total events = 374291520) [17:25:46] RECOVERY - mobileapps endpoints health on scb1003 is OK: All endpoints are healthy [17:25:46] RECOVERY - cxserver endpoints health on scb1003 is OK: All endpoints are healthy [17:25:51] that's ^ scb1003 [17:26:11] not sure if it is related though to mobileapps complaining [17:27:03] 06Operations, 10OfflineContentGenerator, 10Reading-Community-Engagement, 06Reading-Web-Backlog, 06Services: [EPIC] Replicate core OCG features and sunset OCG service - https://phabricator.wikimedia.org/T150871#2799774 (10JKatzWMF) [17:27:30] I don't see any shutdown in mysql log, last modified a month ago Oct 21 09:37 db2049.err [17:27:59] 06Operations, 10OfflineContentGenerator, 10Reading-Community-Engagement, 06Reading-Web-Backlog, 06Services: Confirm attribution needs - https://phabricator.wikimedia.org/T150875#2799776 (10Aklapper) Hi @JKatzWMF. Please associate at least one [[ https://phabricator.wikimedia.org/project/query/G9vp6zKs.If... [17:28:17] I don't see anything weird in temperatures in codfw [17:28:18] 06Operations, 10ops-codfw, 10DBA: db2049 overheated and restarted - https://phabricator.wikimedia.org/T150876#2799779 (10jcrespo) [17:28:44] 06Operations, 10OfflineContentGenerator, 10Reading-Community-Engagement, 06Reading-Web-Backlog, 06Services: Collate wikimedia pages into a single html wikimedia page that can then be rendered into a single pdf - https://phabricator.wikimedia.org/T150874#2799794 (10Aklapper) [17:28:52] it could be the server or a sensor malfunctionaning [17:29:03] 127 seems a bit unrealistic [17:29:12] akosiaris: maybe 127 was the return code :-P [17:29:25] 06Operations, 10ops-codfw, 10DBA: db2049 overheated and restarted - https://phabricator.wikimedia.org/T150876#2799779 (10Marostegui) I am checking the fans logs and they look fine. [17:29:27] it does say 127C [17:29:32] ^exactly [17:29:39] but note how it is 2^8-1 [17:29:42] 06Operations, 10OfflineContentGenerator, 10Reading-Community-Engagement, 06Reading-Web-Backlog, 06Services: Replace OCG in collection extension with Electron - https://phabricator.wikimedia.org/T150872#2799800 (10Aklapper) [17:29:48] er, 2^7-1 [17:30:03] check also db2048 and db2050 maybe [17:30:08] so if temperature is a signed int in a single byte [17:30:19] it may very well be indeed a sensor issue [17:30:37] 06Operations, 10OfflineContentGenerator, 10Reading-Community-Engagement, 06Reading-Web-Backlog, 06Services: [EPIC] Replicate core OCG features and sunset OCG service - https://phabricator.wikimedia.org/T150871#2799815 (10JKatzWMF) [17:30:41] 06Operations, 10Collection, 10OfflineContentGenerator, 10Reading-Community-Engagement, and 2 others: Replace OCG in collection extension with Electron - https://phabricator.wikimedia.org/T150872#2799695 (10Aklapper) [17:30:46] PROBLEM - IPv4 ping to codfw on ripe-atlas-codfw is CRITICAL: CRITICAL - failed 31 probes of 411 (alerts on 19) - https://atlas.ripe.net/measurements/1791210/#!map [17:31:01] ^ ? [17:31:42] I've acked the alerts [17:32:07] we 've been having way too many temperature issues latety [17:32:08] will investigate tomorrow- you are free to do it, although I would like to keep mysql stopped for now [17:32:12] lately* [17:32:29] sure jynus , it might well need to be reimported [17:32:37] volans, not any more! [17:33:00] we depoyed transactional replication and I am 99% sure it works [17:33:00] with GTID I know :D but you might not trust the myisam tables :-P [17:33:03] volans: db2048 looks fine temperature-wise [17:33:17] but I do not want it up if it can crash again [17:34:04] disconnecting, this is not an emergency [17:34:14] o/ [17:35:02] akosiaris: regartding the RIPE alert, now looks ok there, 0 unrechable [17:35:22] but looked like a temporary network/routing issue [17:35:46] RECOVERY - IPv4 ping to codfw on ripe-atlas-codfw is OK: OK - failed 1 probes of 411 (alerts on 19) - https://atlas.ripe.net/measurements/1791210/#!map [17:35:54] kind of weird that it coincided with db2049 alerting [17:37:12] it is however in a different row in a different rack [17:37:21] where do we have the RIPE probe? [17:37:27] yeah I was about to check that [17:37:28] a1 [17:37:41] I didn't make any kernel changes to 2049, I just logged in to check some rdep IIRC [17:37:44] db2049 is c6 [17:37:52] so nothing to do with each other [17:39:06] PROBLEM - IPsec on cp2008 is CRITICAL: Strongswan CRITICAL - ok: 68 not-conn: cp3039_v4, cp3039_v6 [17:39:16] PROBLEM - IPsec on cp1099 is CRITICAL: Strongswan CRITICAL - ok: 54 not-conn: cp3039_v4, cp3039_v6 [17:39:16] PROBLEM - IPsec on cp2014 is CRITICAL: Strongswan CRITICAL - ok: 68 not-conn: cp3039_v4, cp3039_v6 [17:39:16] PROBLEM - IPsec on cp1063 is CRITICAL: Strongswan CRITICAL - ok: 54 not-conn: cp3039_v4, cp3039_v6 [17:39:16] PROBLEM - IPsec on cp1072 is CRITICAL: Strongswan CRITICAL - ok: 54 not-conn: cp3039_v4, cp3039_v6 [17:39:16] PROBLEM - IPsec on cp1050 is CRITICAL: Strongswan CRITICAL - ok: 54 not-conn: cp3039_v4, cp3039_v6 [17:39:16] PROBLEM - IPsec on cp1062 is CRITICAL: Strongswan CRITICAL - ok: 54 not-conn: cp3039_v4, cp3039_v6 [17:39:26] PROBLEM - IPsec on cp1049 is CRITICAL: Strongswan CRITICAL - ok: 54 not-conn: cp3039_v4, cp3039_v6 [17:39:26] PROBLEM - IPsec on cp1048 is CRITICAL: Strongswan CRITICAL - ok: 54 not-conn: cp3039_v4, cp3039_v6 [17:39:26] PROBLEM - IPsec on cp1074 is CRITICAL: Strongswan CRITICAL - ok: 54 not-conn: cp3039_v4, cp3039_v6 [17:39:26] PROBLEM - IPsec on cp2020 is CRITICAL: Strongswan CRITICAL - ok: 68 not-conn: cp3039_v4, cp3039_v6 [17:39:46] PROBLEM - IPsec on cp2011 is CRITICAL: Strongswan CRITICAL - ok: 68 not-conn: cp3039_v4, cp3039_v6 [17:39:46] PROBLEM - IPsec on cp2017 is CRITICAL: Strongswan CRITICAL - ok: 68 not-conn: cp3039_v4, cp3039_v6 [17:39:46] PROBLEM - IPsec on cp2024 is CRITICAL: Strongswan CRITICAL - ok: 68 not-conn: cp3039_v4, cp3039_v6 [17:39:46] PROBLEM - IPsec on cp2026 is CRITICAL: Strongswan CRITICAL - ok: 68 not-conn: cp3039_v4, cp3039_v6 [17:39:46] PROBLEM - IPsec on cp2022 is CRITICAL: Strongswan CRITICAL - ok: 68 not-conn: cp3039_v4, cp3039_v6 [17:39:52] ^ cp3039 is currently rebooting, sorry for the noise [17:39:56] PROBLEM - IPsec on cp1073 is CRITICAL: Strongswan CRITICAL - ok: 54 not-conn: cp3039_v4, cp3039_v6 [17:39:56] PROBLEM - IPsec on cp1071 is CRITICAL: Strongswan CRITICAL - ok: 54 not-conn: cp3039_v4, cp3039_v6 [17:39:56] PROBLEM - IPsec on cp2005 is CRITICAL: Strongswan CRITICAL - ok: 68 not-conn: cp3039_v4, cp3039_v6 [17:39:56] PROBLEM - IPsec on cp1064 is CRITICAL: Strongswan CRITICAL - ok: 54 not-conn: cp3039_v4, cp3039_v6 [17:39:56] PROBLEM - IPsec on cp2002 is CRITICAL: Strongswan CRITICAL - ok: 68 not-conn: cp3039_v4, cp3039_v6 [17:40:16] PROBLEM - Host cp3039 is DOWN: PING CRITICAL - Packet loss = 100% [17:40:46] (03PS2) 10Filippo Giunchedi: role: add prometheus 'global' instance [puppet] - 10https://gerrit.wikimedia.org/r/321814 (https://phabricator.wikimedia.org/T150486) [17:41:50] (03PS1) 10Muehlenhoff: Drop poolcounter role from helium [puppet] - 10https://gerrit.wikimedia.org/r/321902 [17:41:54] (03CR) 10jenkins-bot: [V: 04-1] role: add prometheus 'global' instance [puppet] - 10https://gerrit.wikimedia.org/r/321814 (https://phabricator.wikimedia.org/T150486) (owner: 10Filippo Giunchedi) [17:44:22] mmh cp3039 didn't come back up online yet and I don't seem to be able to connect to cp3039.mgmt.esams.wmnet [17:44:34] could anybody else try? [17:45:36] akosiaris are the temperature alert related to scb1003 alerts? [17:46:35] ema: having a look [17:46:39] thanks [17:47:04] bearND: can't say for sure [17:47:42] maybe, but I am starting to doubt it. It has many log entries for today and no other alerts (yet) [17:47:54] ema: yup, can connect but no ssh banner on mgmt [17:49:10] (03PS5) 10Dereckson: Switch MobileFrontend to extension registration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/314748 (https://phabricator.wikimedia.org/T147092) [17:49:17] I can't connect to either, mgmt and regular host are not responding [17:49:17] moritzm, godog: thanks for double-checking [17:49:36] should I open a ops-esams ticket or do you guys have other suggestions? [17:49:50] yeah, ops-esams it is [17:49:56] alright then [17:50:09] (03CR) 10Dereckson: "PS5: Rebased" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/314748 (https://phabricator.wikimedia.org/T147092) (owner: 10Dereckson) [17:50:31] bearND: scratch that, it actually has alerts, they just were SOFT state and never showed up. [17:50:58] akosiaris: where can I see them? [17:51:24] (03PS3) 10Filippo Giunchedi: role: add prometheus 'global' instance [puppet] - 10https://gerrit.wikimedia.org/r/321814 (https://phabricator.wikimedia.org/T150486) [17:53:17] (03CR) 10jenkins-bot: [V: 04-1] role: add prometheus 'global' instance [puppet] - 10https://gerrit.wikimedia.org/r/321814 (https://phabricator.wikimedia.org/T150486) (owner: 10Filippo Giunchedi) [17:53:34] * godog shakes fist [17:53:46] RECOVERY - IPsec on cp2017 is OK: Strongswan OK - 70 ESP OK [17:53:46] RECOVERY - IPsec on cp2011 is OK: Strongswan OK - 70 ESP OK [17:53:46] RECOVERY - Host cp3039 is UP: PING OK - Packet loss = 0%, RTA = 83.74 ms [17:53:46] RECOVERY - IPsec on cp2024 is OK: Strongswan OK - 70 ESP OK [17:53:46] RECOVERY - IPsec on cp2026 is OK: Strongswan OK - 70 ESP OK [17:53:47] RECOVERY - IPsec on cp2022 is OK: Strongswan OK - 70 ESP OK [17:53:56] RECOVERY - IPsec on cp1073 is OK: Strongswan OK - 56 ESP OK [17:53:56] RECOVERY - IPsec on cp1071 is OK: Strongswan OK - 56 ESP OK [17:53:56] RECOVERY - IPsec on cp2005 is OK: Strongswan OK - 70 ESP OK [17:53:56] RECOVERY - IPsec on cp1064 is OK: Strongswan OK - 56 ESP OK [17:53:56] RECOVERY - IPsec on cp2002 is OK: Strongswan OK - 70 ESP OK [17:53:57] bearND: icinga.wikimedia.org [17:54:06] RECOVERY - IPsec on cp2008 is OK: Strongswan OK - 70 ESP OK [17:54:15] bearND: e.g. https://icinga.wikimedia.org/cgi-bin/icinga/history.cgi?host=scb1004&service=mobileapps+endpoints+health [17:54:16] RECOVERY - IPsec on cp1099 is OK: Strongswan OK - 56 ESP OK [17:54:16] RECOVERY - IPsec on cp2014 is OK: Strongswan OK - 70 ESP OK [17:54:16] RECOVERY - IPsec on cp1072 is OK: Strongswan OK - 56 ESP OK [17:54:16] RECOVERY - IPsec on cp1063 is OK: Strongswan OK - 56 ESP OK [17:54:17] RECOVERY - IPsec on cp1062 is OK: Strongswan OK - 56 ESP OK [17:54:17] RECOVERY - IPsec on cp1050 is OK: Strongswan OK - 56 ESP OK [17:54:26] RECOVERY - IPsec on cp1049 is OK: Strongswan OK - 56 ESP OK [17:54:26] RECOVERY - IPsec on cp1048 is OK: Strongswan OK - 56 ESP OK [17:54:26] RECOVERY - IPsec on cp1074 is OK: Strongswan OK - 56 ESP OK [17:54:26] RECOVERY - IPsec on cp2020 is OK: Strongswan OK - 70 ESP OK [17:54:58] uh and now cp3039 is back online? [17:55:05] scb1004 has complained about temperature issues as well btw [17:55:18] I 'll lower the weight for mobileapps on these 2 hosts [17:55:22] bearND: ^ [17:55:58] akosiaris: ok, thank you. [17:56:17] !log akosiaris@puppetmaster1001 conftool action : set/weight=5; selector: scb1003.eqiad.wmnet (tags: ['dc=eqiad', 'cluster=scb', 'service=mobileapps']) [17:56:22] !log akosiaris@puppetmaster1001 conftool action : set/weight=5; selector: scb1004.eqiad.wmnet (tags: ['dc=eqiad', 'cluster=scb', 'service=mobileapps']) [17:56:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:56:49] This is not off topic as wikimedia uses linux, microsoft joins the linux foundation http://arstechnica.com/information-technology/2016/11/microsoft-yes-microsoft-joins-the-linux-foundation :) [17:56:51] akosiaris: what's the weight on the other two machines, so i can compare? [17:56:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:56:56] !lower the mobileapps weight for scb1003, scb1004. There seems to be temperature issues with those hosts, lowering the load might help [17:56:57] mutante ^^ [17:57:10] bearND: https://config-master.wikimedia.org/conftool/eqiad/mobileapps [17:58:25] ema: maybe it just went into a very long hardware self-test or whatever [17:58:44] akosiaris: thanks. It's weird that 1003 and 1004 have theses issues. I was thinking they should have been doing better since they have twice the amount of memory than the first two [17:58:48] like testing all of the installed memory byte by byte :P [17:58:52] It is what it is [17:59:04] memory is irrelevant in this case I think [17:59:10] moritzm: does the management interface also stop accepting ssh connection in that case? [17:59:15] it's CPU that's overheating [17:59:45] maybe we should try to apply that thermal paste we 've applied to other hosts [18:00:13] it would not be the first time we 've had problems due to inadequate thermal paste [18:01:29] (03CR) 10Filippo Giunchedi: [C: 04-1] "LGTM overall, some more comments" (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/321654 (https://phabricator.wikimedia.org/T146055) (owner: 1020after4) [18:01:53] ema: I have no idea! [18:04:58] (03CR) 10Alexandros Kosiaris: [C: 031] Drop poolcounter role from helium [puppet] - 10https://gerrit.wikimedia.org/r/321902 (owner: 10Muehlenhoff) [18:06:36] bearND: I 've filed https://phabricator.wikimedia.org/T150882 [18:07:06] (03PS4) 10Filippo Giunchedi: role: add prometheus 'global' instance [puppet] - 10https://gerrit.wikimedia.org/r/321814 (https://phabricator.wikimedia.org/T150486) [18:07:45] PROBLEM - NTP on cp3039 is CRITICAL: NTP CRITICAL: Offset unknown [18:08:12] (03CR) 10Filippo Giunchedi: [C: 031] add additional information on malformed responses [software/service-checker] - 10https://gerrit.wikimedia.org/r/321714 (https://phabricator.wikimedia.org/T150560) (owner: 10Volans) [18:09:25] ema: in theory shiuld not AFAIK [18:12:53] akosiaris: thanks. Are the CPUs in scb1003 and 1004 comparable to scb1001 + 1002? [18:14:51] (03CR) 10Andrew Bogott: [C: 032] Add $managed flag to mariadb::service [puppet/mariadb] - 10https://gerrit.wikimedia.org/r/318859 (owner: 10Andrew Bogott) [18:16:02] bearND: they are not the same if that's what you mean. But scb1003 and scb1004 have more and more powerful [18:16:49] it's Intel(R) Xeon(R) CPU E5-2450 v2 @ 2.50GHz (scb1003,4) vs Intel(R) Xeon(R) CPU E5-2440 0 @ 2.40GHz (scb1001,2) [18:16:51] akosiaris: oh, wow. Yeah, I hope that paste will help [18:17:10] so do I [18:17:21] it's either that or a bug in the firmware of the motherboard [18:23:11] (03CR) 10Filippo Giunchedi: Enable multiple config files in phabricator (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/321654 (https://phabricator.wikimedia.org/T146055) (owner: 1020after4) [18:24:10] relocating, brb [18:25:11] (03CR) 10Krinkle: "Core patch merged." [puppet] - 10https://gerrit.wikimedia.org/r/320928 (https://phabricator.wikimedia.org/T150124) (owner: 10Aaron Schulz) [18:27:01] (03CR) 10Alexandros Kosiaris: "> No no, it just means that at any given time, a single broker is responsible for a particular topic. If that broker goes down, another o" [puppet] - 10https://gerrit.wikimedia.org/r/320690 (https://phabricator.wikimedia.org/T143925) (owner: 10Ottomata) [18:29:52] (03CR) 10Alexandros Kosiaris: [C: 04-1] "minor inline comment" (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/320690 (https://phabricator.wikimedia.org/T143925) (owner: 10Ottomata) [18:34:21] (03PS1) 10Krinkle: StartProfile: Add try/catch around Xhgui->save() [mediawiki-config] - 10https://gerrit.wikimedia.org/r/321907 [18:36:30] (03PS11) 1020after4: Enable multiple config files in phabricator [puppet] - 10https://gerrit.wikimedia.org/r/321654 (https://phabricator.wikimedia.org/T146055) [18:37:48] RECOVERY - NTP on cp3039 is OK: NTP OK: Offset 0.0005393922329 secs [18:46:06] (03CR) 1020after4: Enable multiple config files in phabricator (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/321654 (https://phabricator.wikimedia.org/T146055) (owner: 1020after4) [18:47:50] (03CR) 1020after4: Enable multiple config files in phabricator (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/321654 (https://phabricator.wikimedia.org/T146055) (owner: 1020after4) [18:48:28] (03CR) 10Aaron Schulz: [C: 031] StartProfile: Add try/catch around Xhgui->save() [mediawiki-config] - 10https://gerrit.wikimedia.org/r/321907 (owner: 10Krinkle) [18:48:54] (03CR) 1020after4: [C: 031] Standardize most of the docroots [mediawiki-config] - 10https://gerrit.wikimedia.org/r/321726 (owner: 10Chad) [18:49:40] (03CR) 1020after4: [C: 031] Kill skins-1.5 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/321730 (owner: 10Chad) [18:55:35] (03CR) 10Krinkle: "Actually, I can't find any record of such an exception anywhere in logstash. Might not be worth changing the code for." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/321907 (owner: 10Krinkle) [18:56:58] legoktm wikibugs seems to have quit, could you reconnect it please? [19:00:04] addshore, hashar, anomie, ostriches, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, and thcipriani: Dear anthropoid, the time has come. Please deploy Morning SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20161116T1900). [19:00:04] (03PS1) 10Dzahn: icinga/cirrus: lower disk space crit threshold to 12% [puppet] - 10https://gerrit.wikimedia.org/r/321913 [19:00:04] James_F: A patch you scheduled for Morning SWAT (Max 8 patches) is about to be deployed. Please be available during the process. [19:00:13] * James_F waves. [19:00:24] I can SWAT today [19:00:39] Just a trivial config patch of Roan's that I wanted to expedite. [19:01:01] thcipriani: I've got some more nasty ones we could try and see if it breaks the world :p [19:01:12] (03PS2) 10Filippo Giunchedi: graphite: avoid spikes in mw error rate alert [puppet] - 10https://gerrit.wikimedia.org/r/321577 [19:01:20] ostriches: of course you do :) [19:01:36] (03PS3) 10Thcipriani: Fix notification icon path for foundationwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/319967 (owner: 10Catrope) [19:02:09] (03CR) 10Krinkle: [C: 04-1] "Not dead enough imho:" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/321730 (owner: 10Chad) [19:02:39] (03CR) 10Thcipriani: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/319967 (owner: 10Catrope) [19:02:41] (03PS2) 10Dzahn: icinga/cirrus: lower disk space crit threshold to 12% [puppet] - 10https://gerrit.wikimedia.org/r/321913 (https://phabricator.wikimedia.org/T130329) [19:03:05] Krinkle: I CAN'T BE RESPONSIBLE FOR PEOPLE'S STUPID ONWIKI JS/CSS :p [19:03:15] (03Merged) 10jenkins-bot: Fix notification icon path for foundationwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/319967 (owner: 10Catrope) [19:03:34] ostriches: Then mark it "officially" deprecated and announce on tech-ambassador. give it a couple weeks and then fine to go :) [19:03:46] I know it's not an offiical api in the first place, but legacy. [19:03:51] (03CR) 10Filippo Giunchedi: Enable multiple config files in phabricator (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/321654 (https://phabricator.wikimedia.org/T146055) (owner: 1020after4) [19:04:03] I can probably fix them in an hour or two with tourbot [19:04:07] James_F: live on mw1099, check please [19:05:08] thcipriani: It's not really checkable. [19:05:25] thcipriani: I mean, the currently-configured URL 404s and the new one doesn't. [19:05:55] (03CR) 10Filippo Giunchedi: [C: 032] graphite: avoid spikes in mw error rate alert [puppet] - 10https://gerrit.wikimedia.org/r/321577 (owner: 10Filippo Giunchedi) [19:05:56] :) [19:05:59] But the icon's only used for system messages on wmfwiki, which I can't trigger. [19:06:37] following procedure is all, etc. Will sync live. [19:07:04] (03CR) 10Dzahn: [C: 032] icinga/cirrus: lower disk space crit threshold to 12% [puppet] - 10https://gerrit.wikimedia.org/r/321913 (https://phabricator.wikimedia.org/T130329) (owner: 10Dzahn) [19:07:09] (03PS3) 10Dzahn: icinga/cirrus: lower disk space crit threshold to 12% [puppet] - 10https://gerrit.wikimedia.org/r/321913 (https://phabricator.wikimedia.org/T130329) [19:07:15] Thanks. [19:07:16] Krinkle: #til I'm not subscribed to tech-ambassador and mostly forgot that was a thing :p [19:07:38] (03PS2) 10Filippo Giunchedi: role: add external_labels to ops prometheus [puppet] - 10https://gerrit.wikimedia.org/r/321813 (https://phabricator.wikimedia.org/T150486) [19:07:55] (03CR) 10Dzahn: [V: 032] icinga/cirrus: lower disk space crit threshold to 12% [puppet] - 10https://gerrit.wikimedia.org/r/321913 (https://phabricator.wikimedia.org/T130329) (owner: 10Dzahn) [19:08:29] !log thcipriani@tin Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:319967|Fix notification icon path for foundationwiki]] (duration: 00m 49s) [19:08:37] ^ James_F live everywhere [19:08:43] * James_F re-checks. [19:08:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:09:11] Yeah, everything looks OK. [19:09:16] Thank you. [19:09:53] PROBLEM - Redis status tcp_6479 on rdb2006 is CRITICAL: CRITICAL ERROR - Redis Library - can not ping 10.192.48.44 on port 6479 [19:10:08] (03CR) 10Krinkle: Standardize most of the docroots (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/321726 (owner: 10Chad) [19:10:13] James_F: thanks for checking :) [19:10:31] Krinkle: So what's the official replacement? Using static.php? [19:10:43] ostriches: Using /w/skins/* [19:10:51] Like plain stock mediawiki [19:10:53] RECOVERY - Redis status tcp_6479 on rdb2006 is OK: OK: REDIS 2.8.17 on 10.192.48.44:6479 has 1 databases (db0) with 3268852 keys, up 16 days 10 hours - replication_delay is 0 [19:10:56] Which works now [19:11:03] it's rewritten via multiversion static.php indeed [19:14:33] PROBLEM - MediaWiki exceptions and fatals per minute on graphite1001 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [50.0] [19:15:32] (03CR) 10Chad: Standardize most of the docroots (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/321726 (owner: 10Chad) [19:15:33] RECOVERY - MediaWiki exceptions and fatals per minute on graphite1001 is OK: OK: Less than 1.00% above the threshold [25.0] [19:15:52] (03CR) 10Krinkle: [C: 031] MW apache: remove bits.wm.o vhost [puppet] - 10https://gerrit.wikimedia.org/r/305536 (https://phabricator.wikimedia.org/T107430) (owner: 10BBlack) [19:18:41] mh I suspect the last mw error about fatals was before https://gerrit.wikimedia.org/r/#/c/321577 applied [19:19:14] (03CR) 10Filippo Giunchedi: [C: 032] role: add external_labels to ops prometheus [puppet] - 10https://gerrit.wikimedia.org/r/321813 (https://phabricator.wikimedia.org/T150486) (owner: 10Filippo Giunchedi) [19:19:18] (03PS3) 10Filippo Giunchedi: role: add external_labels to ops prometheus [puppet] - 10https://gerrit.wikimedia.org/r/321813 (https://phabricator.wikimedia.org/T150486) [19:19:33] PROBLEM - MediaWiki exceptions and fatals per minute on graphite1001 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [50.0] [19:22:33] RECOVERY - MediaWiki exceptions and fatals per minute on graphite1001 is OK: OK: Less than 1.00% above the threshold [25.0] [19:24:43] Ah, gotcha :) [19:24:46] Krinkle: ^ [19:26:42] 06Operations, 10OfflineContentGenerator, 10Reading-Community-Engagement, 06Reading-Web-Backlog, 06Services: [EPIC] Replicate core OCG features and sunset OCG service - https://phabricator.wikimedia.org/T150871#2800307 (10JKatzWMF) @GWicke Thanks! Other corrections and input would be very welcome and t... [19:27:27] PROBLEM - MediaWiki exceptions and fatals per minute on graphite1001 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [50.0] [19:27:28] 06Operations, 06Performance-Team, 10Thumbor: Investigate whether we need a repeat failure guard and/or a poolcounter-like behavior in Thumbor - https://phabricator.wikimedia.org/T150745#2800312 (10fgiunchedi) I think it'd make sense to have similar rate-limit capabilities to avoid overload. Implementation-wi... [19:27:55] (03CR) 10Krinkle: "recheck" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/309363 (owner: 10Chad) [19:28:32] (03CR) 10Krinkle: [C: 031] Remove bits docroot [mediawiki-config] - 10https://gerrit.wikimedia.org/r/317657 (owner: 10Chad) [19:29:22] (03PS1) 10Chad: Commons/Usability docroots: Use wikimedia.org standard docroot [puppet] - 10https://gerrit.wikimedia.org/r/321916 [19:29:27] RECOVERY - MediaWiki exceptions and fatals per minute on graphite1001 is OK: OK: Less than 1.00% above the threshold [25.0] [19:35:40] (03PS2) 10Chad: Standardize most of the docroots [mediawiki-config] - 10https://gerrit.wikimedia.org/r/321726 [19:36:14] Krinkle: PS2 skips commons too, pending the puppet change, at which point I'll outright nuke them [19:36:27] PROBLEM - MediaWiki exceptions and fatals per minute on graphite1001 is CRITICAL: CRITICAL: 60.00% of data above the critical threshold [50.0] [19:37:55] (03PS1) 10Chad: Docroots: Remove commons and usability docroots, they use wikimedia.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/321919 [19:38:41] 06Operations, 10OfflineContentGenerator, 10Reading-Community-Engagement, 06Reading-Web-Backlog, 06Services: [EPIC] Replicate core OCG features and sunset OCG service - https://phabricator.wikimedia.org/T150871#2800338 (10cscott) Why "Identify and communicate sunsetting" a mere month before the service is... [19:38:57] PROBLEM - Esams HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 10.00% of data above the critical threshold [1000.0] [19:39:56] (03CR) 10Alex Monk: [C: 04-1] Commons/Usability docroots: Use wikimedia.org standard docroot (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/321916 (owner: 10Chad) [19:39:58] mh that elevated mw exceptions seems to be true, https://graphite.wikimedia.org/render/?width=719&height=328&_salt=1479325152.386&target=logstash.rate.mediawiki.fatal.ERROR.sum&target=logstash.rate.mediawiki.exception.ERROR.sum&from=-2hours [19:40:27] RECOVERY - MediaWiki exceptions and fatals per minute on graphite1001 is OK: OK: Less than 1.00% above the threshold [25.0] [19:40:46] (03CR) 10Chad: "Ahhh, typos galore! amending..." [puppet] - 10https://gerrit.wikimedia.org/r/321916 (owner: 10Chad) [19:43:27] PROBLEM - MediaWiki exceptions and fatals per minute on graphite1001 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [50.0] [19:43:32] (03PS2) 10Chad: Commons/Usability docroots: Use wikimedia.org standard docroot [puppet] - 10https://gerrit.wikimedia.org/r/321916 [19:44:08] godog: db replication? https://grafana.wikimedia.org/dashboard/db/production-logging?from=now-1h&to=now-3m [19:44:42] (03PS3) 10Chad: Commons/Usability docroots: Use wikimedia.org standard docroot [puppet] - 10https://gerrit.wikimedia.org/r/321916 [19:46:11] elukey: could be! still checking [19:46:18] (03CR) 10Andrew Bogott: "Would it work to remove the pinning and instead include openstack::repo in a role for this host someplace? That would get you client vers" [puppet] - 10https://gerrit.wikimedia.org/r/306220 (https://phabricator.wikimedia.org/T137217) (owner: 10Hashar) [19:46:35] godog, elukey: logstash concurs. At least, that's what MW is thinking and complaining about. [19:47:37] PROBLEM - puppet last run on labsdb1005 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [19:48:05] yup, looks like it started at 18:55 from logstash "mediawiki-errors" [19:50:27] RECOVERY - MediaWiki exceptions and fatals per minute on graphite1001 is OK: OK: Less than 1.00% above the threshold [25.0] [19:50:47] mhh "Server db1034 (#2) is not replicating?" though that shouldn't result in a fatal afaik [19:52:05] might still be throwing an exception {"id":"88e02416a143b7688fdd3515","type":"DBExpectedError","file":"/srv/mediawiki/php-1.29.0-wmf.2/includes/libs/rdbms/database/DatabaseMysqlBase.php","line":814,"message":"Failed to query MASTER_POS_WAIT()","code":0,"url":"/rpc/RunJobs.php?wiki=svwiki&typ [19:52:57] PROBLEM - Esams HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 10.00% of data above the critical threshold [1000.0] [19:53:33] 06Operations: investigate shared inbox options - https://phabricator.wikimedia.org/T146746#2800351 (10Dzahn) I tested what happens if you send mail to the group address from an external non-WMF domain, and i got this auto reply: ``` We're writing to let you know that the group you tried to contact (ops-mainten... [19:55:25] but definitely db1034 has been suffering a bit lately with on and off lag [19:55:31] lately == in the last hour [19:55:59] (03CR) 10Andrew Bogott: [C: 031] "This all looks reasonable to me. If you've already tested it a lot then we can talk about deploying, otherwise we should try testing it o" [puppet] - 10https://gerrit.wikimedia.org/r/309216 (https://phabricator.wikimedia.org/T144955) (owner: 10BryanDavis) [19:59:58] PROBLEM - Esams HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 10.00% of data above the critical threshold [1000.0] [20:00:04] twentyafterfour: Respected human, time to deploy MediaWiki train (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20161116T2000). Please do the needful. [20:01:32] following up on -databases [20:03:58] PROBLEM - Esams HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 10.00% of data above the critical threshold [1000.0] [20:06:58] PROBLEM - Esams HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 10.00% of data above the critical threshold [1000.0] [20:09:21] afaict it is mostly T147648 [20:09:21] T147648: Unexplained increase in thumbnail 500s - https://phabricator.wikimedia.org/T147648 [20:10:15] 06Operations, 10Traffic, 10media-storage: Unexplained increase in thumbnail 500s - https://phabricator.wikimedia.org/T147648#2800392 (10fgiunchedi) @JoeWalsh is there a timeline for 5.3.0 ? We're still seeing significant traffic for 0px requests [20:10:32] (03PS2) 10Dduvall: docker: apt repo before installing package [puppet] - 10https://gerrit.wikimedia.org/r/321485 [20:10:34] (03PS3) 10Dduvall: [WIP] contint: New role for Docker based CI slave [puppet] - 10https://gerrit.wikimedia.org/r/320942 (https://phabricator.wikimedia.org/T150502) [20:10:58] PROBLEM - Esams HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] [20:13:55] (03PS1) 10EBernhardson: Increase cirrus interwiki loadtest to 50% [mediawiki-config] - 10https://gerrit.wikimedia.org/r/321925 (https://phabricator.wikimedia.org/T149740) [20:15:01] (03PS1) 10Reedy: Log successful login attempts for a while [mediawiki-config] - 10https://gerrit.wikimedia.org/r/321926 (https://phabricator.wikimedia.org/T150554) [20:17:38] RECOVERY - puppet last run on labsdb1005 is OK: OK: Puppet is currently enabled, last run 47 seconds ago with 0 failures [20:18:17] (03PS1) 10Filippo Giunchedi: templates: add cassandra instances for restbase201[012] [dns] - 10https://gerrit.wikimedia.org/r/321927 (https://phabricator.wikimedia.org/T150680) [20:19:39] (03PS6) 10Yuvipanda: base: Move package list to hiera [puppet] - 10https://gerrit.wikimedia.org/r/321495 [20:20:02] (03CR) 10Yuvipanda: [C: 032 V: 032] "I renamed the YAML file :) If this file gets renamed, please do poke me." [puppet] - 10https://gerrit.wikimedia.org/r/321495 (owner: 10Yuvipanda) [20:20:37] (03PS4) 10Andrew Bogott: dns-floating-ip-updater: use python's ipaddress class to determine PTR FQDNs for IPs [puppet] - 10https://gerrit.wikimedia.org/r/309708 (owner: 10Alex Monk) [20:21:37] (03CR) 10Filippo Giunchedi: [C: 032] templates: add cassandra instances for restbase201[012] [dns] - 10https://gerrit.wikimedia.org/r/321927 (https://phabricator.wikimedia.org/T150680) (owner: 10Filippo Giunchedi) [20:22:18] (03CR) 10Reedy: [C: 032] Log successful login attempts for a while [mediawiki-config] - 10https://gerrit.wikimedia.org/r/321926 (https://phabricator.wikimedia.org/T150554) (owner: 10Reedy) [20:22:29] puppet storm incoming is me [20:22:34] (03CR) 10Reedy: [C: 04-2] "Blocking for a moment" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/321926 (https://phabricator.wikimedia.org/T150554) (owner: 10Reedy) [20:22:38] (03PS1) 10Yuvipanda: base: Actually move package file [puppet] - 10https://gerrit.wikimedia.org/r/321929 [20:22:38] PROBLEM - puppet last run on neodymium is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [20:22:48] PROBLEM - puppet last run on cp3046 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [20:22:55] (03PS2) 10Yuvipanda: base: Actually move package file [puppet] - 10https://gerrit.wikimedia.org/r/321929 [20:22:58] PROBLEM - puppet last run on wtp1013 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [20:22:58] PROBLEM - puppet last run on mw2167 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [20:22:59] PROBLEM - puppet last run on uranium is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [20:22:59] PROBLEM - puppet last run on aluminium is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [20:22:59] PROBLEM - puppet last run on zosma is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [20:23:07] (03CR) 10Yuvipanda: [C: 032 V: 032] base: Actually move package file [puppet] - 10https://gerrit.wikimedia.org/r/321929 (owner: 10Yuvipanda) [20:23:08] PROBLEM - puppet last run on krypton is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [20:23:08] PROBLEM - puppet last run on ms-be1024 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [20:23:08] PROBLEM - puppet last run on db1093 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [20:23:18] PROBLEM - puppet last run on mc1011 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [20:23:20] it'll recover shortly [20:23:22] sorry about that [20:23:28] PROBLEM - puppet last run on db1063 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [20:23:28] PROBLEM - puppet last run on ocg1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [20:23:28] PROBLEM - puppet last run on ms-fe1003 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [20:23:28] PROBLEM - puppet last run on mw1200 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [20:23:28] PROBLEM - puppet last run on aqs1004 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [20:23:36] (03PS2) 10Andrew Bogott: Remove wikitech references from ldapconfig [puppet] - 10https://gerrit.wikimedia.org/r/309705 (owner: 10Alex Monk) [20:23:38] PROBLEM - puppet last run on db1074 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [20:23:38] PROBLEM - puppet last run on mc1016 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [20:23:38] PROBLEM - puppet last run on ms-be2017 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [20:23:38] PROBLEM - puppet last run on dbproxy1006 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [20:23:38] PROBLEM - puppet last run on dbproxy1009 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [20:23:38] PROBLEM - puppet last run on mw2231 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [20:23:39] PROBLEM - puppet last run on mw2186 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [20:23:39] PROBLEM - puppet last run on mw2144 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [20:23:40] PROBLEM - puppet last run on mc1026 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [20:23:48] PROBLEM - puppet last run on labcontrol1002 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [20:23:48] PROBLEM - puppet last run on ms-be2016 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [20:23:48] PROBLEM - puppet last run on puppetmaster1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [20:23:48] PROBLEM - puppet last run on ms-be1004 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [20:23:48] PROBLEM - puppet last run on elastic2020 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [20:23:48] PROBLEM - puppet last run on cp2017 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [20:23:49] PROBLEM - puppet last run on lvs2003 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [20:23:49] PROBLEM - puppet last run on elastic2005 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [20:23:50] PROBLEM - puppet last run on mc1032 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [20:23:51] PROBLEM - puppet last run on mw1297 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [20:23:51] PROBLEM - puppet last run on cp3045 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [20:23:57] o_O [20:23:58] PROBLEM - puppet last run on db1070 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [20:23:58] PROBLEM - puppet last run on poolcounter1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [20:23:58] PROBLEM - puppet last run on mw1185 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [20:23:58] PROBLEM - puppet last run on lvs4002 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [20:23:59] PROBLEM - puppet last run on db1053 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [20:23:59] PROBLEM - puppet last run on labstore2001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [20:23:59] PROBLEM - puppet last run on db2050 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [20:24:08] PROBLEM - puppet last run on pc1004 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [20:24:08] PROBLEM - puppet last run on mw2209 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [20:24:18] PROBLEM - puppet last run on wdqs1002 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [20:24:28] PROBLEM - puppet last run on elastic1023 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [20:24:28] PROBLEM - puppet last run on netmon1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [20:24:28] PROBLEM - puppet last run on mw1270 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [20:24:28] PROBLEM - puppet last run on mw2219 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [20:24:38] PROBLEM - puppet last run on wtp1020 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [20:24:38] PROBLEM - puppet last run on logstash1005 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [20:24:38] PROBLEM - puppet last run on pc2006 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [20:24:38] PROBLEM - puppet last run on es2016 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [20:24:38] PROBLEM - puppet last run on labstore2004 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [20:24:39] PROBLEM - puppet last run on db2011 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [20:24:39] PROBLEM - puppet last run on mc1006 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [20:24:48] PROBLEM - puppet last run on mw1305 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [20:24:48] PROBLEM - puppet last run on sca1003 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [20:24:49] PROBLEM - puppet last run on mw2091 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [20:24:49] PROBLEM - puppet last run on db2067 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [20:24:49] PROBLEM - puppet last run on cp3012 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [20:24:49] PROBLEM - puppet last run on ms-fe3001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [20:24:55] (03PS3) 10Andrew Bogott: Remove wikitech references from ldapconfig [puppet] - 10https://gerrit.wikimedia.org/r/309705 (owner: 10Alex Monk) [20:24:58] PROBLEM - puppet last run on relforge1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [20:24:58] PROBLEM - puppet last run on labservices1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [20:24:58] PROBLEM - puppet last run on cp1055 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [20:24:58] PROBLEM - puppet last run on elastic2021 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [20:24:58] PROBLEM - puppet last run on ms-be1019 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [20:24:58] PROBLEM - puppet last run on mw2118 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [20:24:59] PROBLEM - puppet last run on maerlant is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [20:24:59] PROBLEM - puppet last run on kafka1018 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [20:25:00] PROBLEM - puppet last run on kubernetes1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [20:25:08] (03PS1) 10Yuvipanda: base: Remove some duplicate package installs [puppet] - 10https://gerrit.wikimedia.org/r/321930 [20:25:08] PROBLEM - puppet last run on analytics1041 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [20:25:08] PROBLEM - puppet last run on mw1174 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [20:25:08] PROBLEM - puppet last run on mc2004 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [20:25:08] PROBLEM - puppet last run on ganeti2003 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [20:25:08] PROBLEM - puppet last run on ganeti2002 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [20:25:08] PROBLEM - puppet last run on mw2157 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [20:25:09] PROBLEM - puppet last run on stat1004 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [20:25:10] PROBLEM - puppet last run on mw2248 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [20:25:18] PROBLEM - puppet last run on analytics1053 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [20:25:18] PROBLEM - puppet last run on ms-be1025 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [20:25:18] PROBLEM - puppet last run on ms-be1003 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [20:25:28] PROBLEM - puppet last run on iridium is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [20:25:28] PROBLEM - puppet last run on dbstore1002 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [20:25:28] PROBLEM - puppet last run on elastic1021 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [20:25:29] PROBLEM - puppet last run on wtp1006 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [20:25:29] PROBLEM - puppet last run on mw1219 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [20:25:38] (03PS2) 10Yuvipanda: base: Remove some duplicate package installs [puppet] - 10https://gerrit.wikimedia.org/r/321930 [20:25:38] PROBLEM - puppet last run on mw1231 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [20:25:38] PROBLEM - puppet last run on db1011 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [20:25:38] PROBLEM - puppet last run on es2002 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [20:25:38] PROBLEM - puppet last run on sarin is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [20:25:38] PROBLEM - puppet last run on mw2141 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [20:25:46] (03CR) 10Yuvipanda: [C: 032 V: 032] base: Remove some duplicate package installs [puppet] - 10https://gerrit.wikimedia.org/r/321930 (owner: 10Yuvipanda) [20:25:48] PROBLEM - puppet last run on mc1018 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [20:25:48] PROBLEM - puppet last run on es2014 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [20:25:48] PROBLEM - puppet last run on wtp1010 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [20:25:48] PROBLEM - puppet last run on mw1203 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [20:25:48] PROBLEM - puppet last run on graphite1002 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [20:25:49] PROBLEM - puppet last run on elastic1020 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [20:25:49] PROBLEM - puppet last run on mw1226 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [20:25:50] PROBLEM - puppet last run on mw1260 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [20:25:50] PROBLEM - puppet last run on cp3039 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [20:25:50] PROBLEM - puppet last run on lvs3001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [20:25:58] PROBLEM - puppet last run on mc1002 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [20:25:58] PROBLEM - puppet last run on mc1030 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [20:25:58] PROBLEM - puppet last run on analytics1040 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [20:25:59] PROBLEM - puppet last run on analytics1028 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [20:25:59] PROBLEM - puppet last run on wtp1016 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [20:25:59] PROBLEM - puppet last run on mw1262 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [20:25:59] PROBLEM - puppet last run on restbase1012 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [20:25:59] PROBLEM - puppet last run on mw2234 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [20:26:00] PROBLEM - puppet last run on prometheus2001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [20:26:00] PROBLEM - puppet last run on mw2217 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [20:26:01] PROBLEM - puppet last run on mw2145 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [20:26:02] PROBLEM - puppet last run on cp3019 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [20:26:18] PROBLEM - puppet last run on terbium is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [20:26:18] PROBLEM - puppet last run on ms-be1022 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [20:26:18] PROBLEM - puppet last run on elastic2006 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [20:26:18] PROBLEM - puppet last run on mw1222 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [20:26:18] PROBLEM - puppet last run on db1072 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [20:26:28] PROBLEM - puppet last run on mw1220 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [20:26:28] PROBLEM - puppet last run on db1045 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [20:26:31] (03CR) 10Andrew Bogott: [C: 032] Remove wikitech references from ldapconfig [puppet] - 10https://gerrit.wikimedia.org/r/309705 (owner: 10Alex Monk) [20:26:35] (03PS4) 10Andrew Bogott: Remove wikitech references from ldapconfig [puppet] - 10https://gerrit.wikimedia.org/r/309705 (owner: 10Alex Monk) [20:26:38] PROBLEM - puppet last run on mw1215 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [20:26:48] PROBLEM - puppet last run on db1086 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [20:26:48] PROBLEM - puppet last run on fermium is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [20:26:48] PROBLEM - puppet last run on dubnium is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [20:26:48] PROBLEM - puppet last run on elastic2014 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [20:26:48] PROBLEM - puppet last run on wtp1008 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [20:26:48] PROBLEM - puppet last run on pybal-test2002 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [20:26:49] PROBLEM - puppet last run on wtp2008 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [20:26:49] PROBLEM - puppet last run on cp2002 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [20:26:50] PROBLEM - puppet last run on db2064 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [20:26:51] PROBLEM - puppet last run on maps2003 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [20:26:51] PROBLEM - puppet last run on cp4009 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [20:26:52] PROBLEM - puppet last run on cp3016 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [20:26:58] PROBLEM - puppet last run on cp1068 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [20:26:58] PROBLEM - puppet last run on cp1053 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [20:26:58] PROBLEM - puppet last run on mw1228 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [20:26:58] PROBLEM - puppet last run on mw1170 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [20:26:58] PROBLEM - puppet last run on mc2005 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [20:26:58] PROBLEM - puppet last run on mw2158 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [20:26:59] PROBLEM - puppet last run on wtp2017 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [20:26:59] PROBLEM - puppet last run on mw2126 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [20:27:00] PROBLEM - puppet last run on db2058 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [20:27:00] PROBLEM - puppet last run on mc2001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [20:27:14] (03PS1) 10Yuvipanda: base: Remove more duplicates [puppet] - 10https://gerrit.wikimedia.org/r/321932 [20:27:20] PROBLEM - puppet last run on ms-fe1004 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [20:27:20] PROBLEM - puppet last run on dbproxy1008 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [20:27:20] PROBLEM - puppet last run on maps1002 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [20:27:20] PROBLEM - puppet last run on mc1017 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [20:27:20] PROBLEM - puppet last run on etcd1003 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [20:27:28] PROBLEM - puppet last run on labnet1002 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [20:27:28] PROBLEM - puppet last run on db1056 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [20:27:28] PROBLEM - puppet last run on mw1276 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [20:27:35] (03CR) 10jenkins-bot: [V: 04-1] base: Remove more duplicates [puppet] - 10https://gerrit.wikimedia.org/r/321932 (owner: 10Yuvipanda) [20:27:38] PROBLEM - puppet last run on db1015 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [20:27:38] PROBLEM - puppet last run on mc2007 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [20:27:38] PROBLEM - puppet last run on db2056 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [20:27:38] PROBLEM - puppet last run on db1018 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [20:27:44] (03PS2) 10Yuvipanda: base: Remove more duplicates [puppet] - 10https://gerrit.wikimedia.org/r/321932 [20:27:48] PROBLEM - puppet last run on mc2015 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [20:27:48] PROBLEM - puppet last run on pc1006 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [20:27:48] PROBLEM - puppet last run on db1028 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [20:27:48] PROBLEM - puppet last run on db1046 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [20:27:48] PROBLEM - puppet last run on db1059 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [20:27:49] PROBLEM - puppet last run on elastic2007 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [20:27:49] PROBLEM - puppet last run on mw1284 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [20:27:49] PROBLEM - puppet last run on eventlog2001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [20:27:50] PROBLEM - puppet last run on cp3007 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [20:27:50] PROBLEM - puppet last run on cp3017 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [20:27:58] PROBLEM - puppet last run on cp1054 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [20:27:58] PROBLEM - puppet last run on mc1021 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [20:27:58] PROBLEM - puppet last run on mw1172 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [20:27:58] PROBLEM - puppet last run on analytics1047 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [20:27:58] PROBLEM - puppet last run on subra is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [20:27:58] PROBLEM - puppet last run on mw2250 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [20:27:59] PROBLEM - puppet last run on snapshot1007 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [20:28:00] PROBLEM - puppet last run on scb1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [20:28:00] PROBLEM - puppet last run on mw2228 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [20:28:00] (03CR) 10Yuvipanda: [C: 032 V: 032] base: Remove more duplicates [puppet] - 10https://gerrit.wikimedia.org/r/321932 (owner: 10Yuvipanda) [20:28:01] PROBLEM - puppet last run on mw2207 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [20:28:01] PROBLEM - puppet last run on db1067 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [20:28:01] PROBLEM - puppet last run on mw1251 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [20:28:02] (03CR) 10Alex Monk: "Thanks! I've started beta actually using this because no one could find the root mysql password (I made a note in the project SAL too)" [puppet] - 10https://gerrit.wikimedia.org/r/321878 (https://phabricator.wikimedia.org/T150446) (owner: 10Jcrespo) [20:28:08] PROBLEM - puppet last run on chromium is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [20:28:08] PROBLEM - puppet last run on mc1027 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [20:28:08] PROBLEM - puppet last run on lvs2002 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [20:28:08] PROBLEM - puppet last run on druid1002 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [20:28:08] PROBLEM - puppet last run on einsteinium is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [20:28:09] PROBLEM - puppet last run on mw1177 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [20:28:09] PROBLEM - puppet last run on kafka1002 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [20:28:10] PROBLEM - puppet last run on ms-be1010 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [20:28:18] PROBLEM - puppet last run on db1095 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [20:28:18] PROBLEM - puppet last run on mw1199 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [20:28:28] PROBLEM - puppet last run on hydrogen is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [20:28:28] PROBLEM - puppet last run on wtp1005 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [20:28:38] PROBLEM - puppet last run on elastic1042 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [20:28:38] PROBLEM - puppet last run on cp2013 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [20:28:38] PROBLEM - puppet last run on mw2095 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [20:28:38] PROBLEM - puppet last run on mw2120 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [20:28:38] PROBLEM - puppet last run on wtp2015 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [20:28:47] (03CR) 10Reedy: [C: 032] Log successful login attempts for a while [mediawiki-config] - 10https://gerrit.wikimedia.org/r/321926 (https://phabricator.wikimedia.org/T150554) (owner: 10Reedy) [20:28:48] PROBLEM - puppet last run on conf1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [20:28:48] PROBLEM - puppet last run on ms-be2021 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [20:28:48] PROBLEM - puppet last run on mc1022 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [20:28:48] PROBLEM - puppet last run on elastic2024 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [20:28:48] PROBLEM - puppet last run on db2044 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [20:28:48] PROBLEM - puppet last run on mw2136 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [20:28:49] PROBLEM - puppet last run on restbase1007 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [20:28:49] PROBLEM - puppet last run on mw1195 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [20:28:50] PROBLEM - puppet last run on cp2001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [20:28:51] PROBLEM - puppet last run on db2055 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [20:28:51] PROBLEM - puppet last run on cp4016 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [20:28:52] PROBLEM - puppet last run on cp4010 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [20:28:52] PROBLEM - puppet last run on cp3008 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [20:28:58] PROBLEM - puppet last run on snapshot1006 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [20:28:58] PROBLEM - puppet last run on restbase2006 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [20:28:59] PROBLEM - puppet last run on es2013 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [20:28:59] PROBLEM - puppet last run on wtp2004 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [20:28:59] PROBLEM - puppet last run on mw2146 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [20:28:59] PROBLEM - puppet last run on mw2077 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [20:28:59] PROBLEM - puppet last run on cp3048 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [20:28:59] PROBLEM - puppet last run on cp3036 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [20:29:08] PROBLEM - puppet last run on es2018 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [20:29:08] PROBLEM - puppet last run on mw2198 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [20:29:18] PROBLEM - puppet last run on cp1064 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [20:29:18] PROBLEM - puppet last run on db1049 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [20:29:18] PROBLEM - puppet last run on mw1249 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [20:29:18] (03Merged) 10jenkins-bot: Log successful login attempts for a while [mediawiki-config] - 10https://gerrit.wikimedia.org/r/321926 (https://phabricator.wikimedia.org/T150554) (owner: 10Reedy) [20:29:28] PROBLEM - puppet last run on labsdb1007 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [20:29:28] PROBLEM - puppet last run on elastic1039 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [20:29:28] PROBLEM - puppet last run on snapshot1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [20:29:28] PROBLEM - puppet last run on wtp1012 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [20:29:28] PROBLEM - puppet last run on silver is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [20:29:29] PROBLEM - puppet last run on scb2001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [20:29:38] PROBLEM - puppet last run on ms-be1014 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [20:29:38] PROBLEM - puppet last run on mw1289 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [20:29:38] PROBLEM - puppet last run on maps2004 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [20:29:47] temporarily [20:29:53] well, that didn't work [20:30:09] (03PS5) 10Andrew Bogott: Remove wikitech references from ldapconfig [puppet] - 10https://gerrit.wikimedia.org/r/309705 (owner: 10Alex Monk) [20:30:18] PROBLEM - puppet last run on elastic1038 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [20:30:18] PROBLEM - puppet last run on praseodymium is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [20:30:18] PROBLEM - puppet last run on lvs2006 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [20:30:28] PROBLEM - puppet last run on analytics1026 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [20:30:38] PROBLEM - puppet last run on db1060 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [20:30:38] PROBLEM - puppet last run on mw1302 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [20:30:38] PROBLEM - puppet last run on db2068 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [20:30:38] PROBLEM - puppet last run on mw2177 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [20:30:38] PROBLEM - puppet last run on cp3003 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [20:30:38] (03CR) 10Hashar: "I wasn't aware of the openstack::repo class which even can get the version from hiera. Looks like a good way to stay in sync. Then it is" [puppet] - 10https://gerrit.wikimedia.org/r/306220 (https://phabricator.wikimedia.org/T137217) (owner: 10Hashar) [20:30:38] PROBLEM - puppet last run on bast4001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [20:30:43] godog, yes it did [20:30:48] PROBLEM - puppet last run on dbproxy1003 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [20:30:48] PROBLEM - puppet last run on db2016 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [20:30:58] PROBLEM - puppet last run on mw1282 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[lldp] [20:30:58] PROBLEM - puppet last run on mc1014 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [20:30:58] PROBLEM - puppet last run on mw1168 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [20:30:58] PROBLEM - puppet last run on wtp1014 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[lldp] [20:30:59] PROBLEM - puppet last run on db2062 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [20:30:59] PROBLEM - puppet last run on mw2088 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [20:30:59] PROBLEM - puppet last run on cp4018 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [20:30:59] PROBLEM - puppet last run on planet2001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [20:31:05] Krenair: I still saw the messages from icinga-wm [20:31:12] cause of lag I guess [20:31:15] godog, yeah but you are opped [20:31:29] ah, ok I've put it back [20:31:39] oh ops see everything for moderation purposes I guess [20:31:40] neat [20:31:46] thanks Krenair [20:31:53] sorry about that [20:31:53] thank you [20:32:45] yuvipanda: np, the puppet compiler I think would have catched that tho [20:33:04] godog: yeah, i ran it on an older version [20:33:10] and it was fine [20:33:16] then made 'minor change' [20:33:21] not so minor [20:33:27] doh [20:33:48] IIRC there was also a proposal on running pcc automatically [20:33:49] !log reedy@tin Synchronized wmf-config/CommonSettings.php: consistency after pulling to gerrit (duration: 00m 49s) [20:34:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:34:09] (03Abandoned) 10Reedy: Log all failed login attempts [mediawiki-config] - 10https://gerrit.wikimedia.org/r/321114 (owner: 10Reedy) [20:36:34] 06Operations, 10Continuous-Integration-Infrastructure, 06Release-Engineering-Team, 10hardware-requests: codfw: 1 hardware access request for continuous integration - https://phabricator.wikimedia.org/T150865#2800470 (10RobH) a:03mark contint1001 has Dual Intel® Xeon® Processor E5-2640 v3 (2.6GHz/8c), dua... [20:36:51] Krenair: I still saw the messages from icinga-wm p858snake|L2: ah! thanks that's useful [20:44:40] !log demon@tin Started scap: llamas on the move! [20:44:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:45:08] (03PS5) 10Andrew Bogott: dns-floating-ip-updater: use python's ipaddress class to determine PTR FQDNs for IPs [puppet] - 10https://gerrit.wikimedia.org/r/309708 (owner: 10Alex Monk) [20:45:33] (03PS1) 10Yuvipanda: Revert "base: Move package list to hiera" [puppet] - 10https://gerrit.wikimedia.org/r/321934 [20:46:09] (03CR) 10jenkins-bot: [V: 04-1] Revert "base: Move package list to hiera" [puppet] - 10https://gerrit.wikimedia.org/r/321934 (owner: 10Yuvipanda) [20:46:21] (03CR) 10Andrew Bogott: [C: 032] dns-floating-ip-updater: use python's ipaddress class to determine PTR FQDNs for IPs [puppet] - 10https://gerrit.wikimedia.org/r/309708 (owner: 10Alex Monk) [20:47:17] (03PS2) 10Yuvipanda: Revert "base: Move package list to hiera" [puppet] - 10https://gerrit.wikimedia.org/r/321934 [20:49:02] (03CR) 10Yuvipanda: [C: 032 V: 032] Revert "base: Move package list to hiera" [puppet] - 10https://gerrit.wikimedia.org/r/321934 (owner: 10Yuvipanda) [20:49:29] (03PS1) 10Filippo Giunchedi: Provision restbase201[012], add restbase2010-a [puppet] - 10https://gerrit.wikimedia.org/r/321935 (https://phabricator.wikimedia.org/T150680) [20:50:31] !log demon@tin Finished scap: llamas on the move! (duration: 05m 51s) [20:50:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:53:20] (03PS1) 10Reedy: Log users elevated groups on login attempts [mediawiki-config] - 10https://gerrit.wikimedia.org/r/321938 [20:54:34] (03PS1) 10Thcipriani: group1 wikis to 1.29.0-wmf.3 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/321939 [20:54:36] (03CR) 10Thcipriani: [C: 032] group1 wikis to 1.29.0-wmf.3 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/321939 (owner: 10Thcipriani) [20:55:06] (03PS4) 10Dduvall: [WIP] contint: New role for Docker based CI slave [puppet] - 10https://gerrit.wikimedia.org/r/320942 (https://phabricator.wikimedia.org/T150502) [20:55:08] (03Merged) 10jenkins-bot: group1 wikis to 1.29.0-wmf.3 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/321939 (owner: 10Thcipriani) [20:56:27] !log demon@tin Synchronized private/: (no message) (duration: 00m 50s) [20:56:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:56:49] (03PS5) 10Ottomata: Deploy EventStreams on scb and configure LVS service in eqiad [puppet] - 10https://gerrit.wikimedia.org/r/320690 (https://phabricator.wikimedia.org/T143925) [20:58:44] !log thcipriani@tin rebuilt wikiversions.php and synchronized wikiversions files: group1 to 1.29.0-wmf.3 [20:59:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:59:53] (03PS1) 10Ottomata: Add eventstreams.svc.eqiad.wmnet [dns] - 10https://gerrit.wikimedia.org/r/321940 (https://phabricator.wikimedia.org/T143925) [21:00:05] gwicke, cscott, arlolra, subbu, bearND, mdholloway, halfak, Amir1, and yurik: Respected human, time to deploy Services – Parsoid / OCG / Citoid / Mobileapps / ORES / … (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20161116T2100). Please do the needful. [21:00:08] (03CR) 10Ottomata: Deploy EventStreams on scb and configure LVS service in eqiad (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/320690 (https://phabricator.wikimedia.org/T143925) (owner: 10Ottomata) [21:00:40] (03PS5) 10Dduvall: [WIP] contint: New role for Docker based CI slave [puppet] - 10https://gerrit.wikimedia.org/r/320942 (https://phabricator.wikimedia.org/T150502) [21:02:35] 06Operations, 06Labs, 13Patch-For-Review, 07Tracking: Add config option in tools webservice debian package to write logs to /dev/null - https://phabricator.wikimedia.org/T149946#2800590 (10yuvipanda) I've reverted and built package and pushed new images. we need to: 1. Install package on all webgrid nodes... [21:06:30] (03PS1) 10Filippo Giunchedi: prometheus: switch 'ops' prometheus to varbit encoding [puppet] - 10https://gerrit.wikimedia.org/r/321941 [21:16:37] (03PS7) 10Andrew Bogott: base: Allow auto puppetmaster switching tuning [puppet] - 10https://gerrit.wikimedia.org/r/256890 (https://phabricator.wikimedia.org/T120159) (owner: 10Yuvipanda) [21:18:47] (03PS8) 10Andrew Bogott: base: Allow auto puppetmaster switching tuning [puppet] - 10https://gerrit.wikimedia.org/r/256890 (https://phabricator.wikimedia.org/T120159) (owner: 10Yuvipanda) [21:19:28] PROBLEM - puppet last run on cp4019 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [21:19:48] PROBLEM - puppet last run on cp3030 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [21:19:59] RECOVERY - puppet last run on mw2202 is OK: OK: Puppet is currently enabled, last run 5 seconds ago with 0 failures [21:20:58] RECOVERY - puppet last run on mw2106 is OK: OK: Puppet is currently enabled, last run 14 seconds ago with 0 failures [21:22:06] (03PS9) 10Andrew Bogott: base: Allow auto puppetmaster switching tuning [puppet] - 10https://gerrit.wikimedia.org/r/256890 (https://phabricator.wikimedia.org/T120159) (owner: 10Yuvipanda) [21:22:43] 06Operations, 10Beta-Cluster-Infrastructure, 10Thumbor: Thumbor keeps losing Swift auth on beta - https://phabricator.wikimedia.org/T150649#2800637 (10fgiunchedi) @Krenair where are you seeing that btw? The issue afaics is that swift on `deployment-ms-fe01` doesn't have the password for `mw:thumbor` in `/et... [21:24:48] (03CR) 10jenkins-bot: [V: 04-1] base: Allow auto puppetmaster switching tuning [puppet] - 10https://gerrit.wikimedia.org/r/256890 (https://phabricator.wikimedia.org/T120159) (owner: 10Yuvipanda) [21:29:41] (03PS10) 10Andrew Bogott: base: Allow auto puppetmaster switching tuning [puppet] - 10https://gerrit.wikimedia.org/r/256890 (https://phabricator.wikimedia.org/T120159) (owner: 10Yuvipanda) [21:30:34] 06Operations, 10Traffic, 05Prometheus-metrics-monitoring: Error collecting metrics from varnish_exporter on some misc hosts - https://phabricator.wikimedia.org/T150479#2800666 (10fgiunchedi) I've captured `varnishstat -j` over the course of 1/2 day on `cp4001` and it seems the uuid is the backend "identity"... [21:38:15] (03CR) 10Andrew Bogott: "Oh, my mistake, that won't work on Jessie at all." [puppet] - 10https://gerrit.wikimedia.org/r/306220 (https://phabricator.wikimedia.org/T137217) (owner: 10Hashar) [21:38:36] (03CR) 10Andrew Bogott: [C: 032] nodepool: bump nova client and openstack CLI [puppet] - 10https://gerrit.wikimedia.org/r/306220 (https://phabricator.wikimedia.org/T137217) (owner: 10Hashar) [21:38:40] (03PS2) 10Andrew Bogott: nodepool: bump nova client and openstack CLI [puppet] - 10https://gerrit.wikimedia.org/r/306220 (https://phabricator.wikimedia.org/T137217) (owner: 10Hashar) [21:39:26] 06Operations, 10Continuous-Integration-Infrastructure, 06Release-Engineering-Team, 10hardware-requests: codfw: 1 hardware access request for continuous integration - https://phabricator.wikimedia.org/T150865#2800694 (10hashar) @RobH pointed out contint1001 does not use SSD and that might be an IO bottlenec... [21:42:26] (03PS11) 10Andrew Bogott: Explicitly set up /var/spool/gridengine on grid master [puppet] - 10https://gerrit.wikimedia.org/r/321584 [21:45:58] PROBLEM - puppet last run on relforge1002 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [21:47:28] RECOVERY - puppet last run on cp4019 is OK: OK: Puppet is currently enabled, last run 30 seconds ago with 0 failures [21:47:48] RECOVERY - puppet last run on cp3030 is OK: OK: Puppet is currently enabled, last run 0 seconds ago with 0 failures [21:47:53] (03CR) 1020after4: [C: 031] Standardize most of the docroots [mediawiki-config] - 10https://gerrit.wikimedia.org/r/321726 (owner: 10Chad) [21:51:57] 06Operations: investigate shared inbox options - https://phabricator.wikimedia.org/T146746#2800727 (10Dzahn) I looked closer at permissions and general settings of group and changed it so that most actions can be performed by all group members, only members can read topics (messages), but the public can post the... [21:55:28] PROBLEM - puppet last run on iridium is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [22:05:17] 06Operations, 10Continuous-Integration-Infrastructure, 07Nodepool, 13Patch-For-Review: Clean up apt:pin of python modules used for Nodepool - https://phabricator.wikimedia.org/T137217#2800743 (10hashar) Will probably want to cleanup apt.wm.o jessie-wikimedia/backports I will reach out to European ops to... [22:06:52] PROBLEM - puppet last run on elastic1043 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [22:13:32] RECOVERY - puppet last run on relforge1002 is OK: OK: Puppet is currently enabled, last run 1 second ago with 0 failures [22:15:02] PROBLEM - puppet last run on rdb1003 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [22:15:49] (03PS1) 10Brian Wolff: Ban 100 most common passwords from ordinary accounts. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/321991 [22:17:08] (03PS1) 10Jforrester: Beta Features: Update whitelist [mediawiki-config] - 10https://gerrit.wikimedia.org/r/321992 [22:17:10] (03PS1) 10Jforrester: Provide the visual editor wikitext mode Beta Feature to all [mediawiki-config] - 10https://gerrit.wikimedia.org/r/321993 [22:23:32] RECOVERY - puppet last run on iridium is OK: OK: Puppet is currently enabled, last run 1 second ago with 0 failures [22:23:49] (03CR) 10Krinkle: [C: 031] Standardize most of the docroots [mediawiki-config] - 10https://gerrit.wikimedia.org/r/321726 (owner: 10Chad) [22:23:56] (03CR) 10Krinkle: [C: 031] Docroots: Remove commons and usability docroots, they use wikimedia.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/321919 (owner: 10Chad) [22:23:58] (03CR) 10Reedy: [C: 031] Ban 100 most common passwords from ordinary accounts. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/321991 (owner: 10Brian Wolff) [22:27:44] (03CR) 10BryanDavis: "> This all looks reasonable to me. If you've already tested it a lot" [puppet] - 10https://gerrit.wikimedia.org/r/309216 (https://phabricator.wikimedia.org/T144955) (owner: 10BryanDavis) [22:35:52] RECOVERY - puppet last run on elastic1043 is OK: OK: Puppet is currently enabled, last run 43 seconds ago with 0 failures [22:36:56] sorry, little bit late for the services window but i'm planning to deploy mobileapps in a minute unless anyone objects [22:37:38] was really my fault [22:38:32] PROBLEM - puppet last run on elastic1022 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [22:39:30] 06Operations, 06Security-Team: icinga notification if elevated writing to badpass.log - https://phabricator.wikimedia.org/T150300#2781237 (10Tgr) Related: {T150903} [22:39:57] !log starting mobileapps deployment [22:40:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:42:45] !log mobileapps deployed 7b04c47 [22:43:02] RECOVERY - puppet last run on rdb1003 is OK: OK: Puppet is currently enabled, last run 58 seconds ago with 0 failures [22:43:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:49:10] jouncebot now [22:49:25] i could of swore thats a cmd [22:50:08] jouncebot: now [22:51:28] jouncebot_ now [22:51:28] No deployments scheduled for the next 1 hour(s) and 8 minute(s) [22:51:41] jouncebot_ reload [22:51:50] jouncebot_ refresh [22:51:54] I refreshed my knowledge about deployments. [22:51:57] jouncebot_ now [22:51:57] No deployments scheduled for the next 1 hour(s) and 8 minute(s) [22:52:13] Zppix|Away ^^ [22:56:33] jouncebot_: next [22:56:33] In 1 hour(s) and 3 minute(s): Evening SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20161117T0000) [23:00:05] whats wrong with "jouncebot" [23:00:30] nothing? [23:00:45] its missing code to recover it's primary nick. I'll kick it [23:00:47] oh, the nick difference [23:01:12] jou always works for me :) [23:01:31] bd808 what code is jouncebot written in i may be able to do something similar to grrrit-wm's nick cmd [23:01:52] it's python. the fix is easy just hasn't been done [23:02:06] bd808 ah repo link? [23:02:12] (03PS1) 10Rush: gridengine: /var/spool/gridengine as a symlink [puppet] - 10https://gerrit.wikimedia.org/r/322008 [23:02:21] i can do python :P [23:02:24] Zppix: https://gerrit.wikimedia.org/r/#/admin/projects/wikimedia/bots/jouncebot [23:03:07] Zppix: it needs something like this -- https://github.com/bd808/tools-stashbot/blob/master/stashbot/bot.py#L95-L100 [23:03:10] ok i will work on it later :) [23:03:21] (03CR) 10jenkins-bot: [V: 04-1] gridengine: /var/spool/gridengine as a symlink [puppet] - 10https://gerrit.wikimedia.org/r/322008 (owner: 10Rush) [23:03:31] bd808 do you want it as a command or just automatically does it [23:03:59] just automatic. telling bots what to do kind of defeats the purpose ;) [23:05:18] bd808 ok just didnt know if there was anything that was perferred considering its a very heavily used bot :P [23:06:00] where do you guys want the code at, im not familar with the jouncebot repo [23:06:02] PROBLEM - tools homepage -admin tool- on tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Temporarily Unavailable - 385 bytes in 0.003 second response time [23:06:18] lol jouncebot i think killed its self [23:06:32] .... that cant be good [23:07:32] RECOVERY - puppet last run on elastic1022 is OK: OK: Puppet is currently enabled, last run 58 seconds ago with 0 failures [23:07:50] I told the job grid to restart it and apparently it did something weird [23:08:33] Zppix: it will be obvious. all of the irc bot parts are in one file named "jouncebot.py" [23:09:27] (03PS2) 10Rush: gridengine: /var/spool/gridengine as a symlink [puppet] - 10https://gerrit.wikimedia.org/r/322008 [23:09:42] (03PS2) 10Mattflaschen: Add dewiktionary to RESTBase on Beta Cluster [puppet] - 10https://gerrit.wikimedia.org/r/321817 (https://phabricator.wikimedia.org/T150764) [23:10:02] RECOVERY - tools homepage -admin tool- on tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 3670 bytes in 0.245 second response time [23:10:39] (03CR) 10jenkins-bot: [V: 04-1] gridengine: /var/spool/gridengine as a symlink [puppet] - 10https://gerrit.wikimedia.org/r/322008 (owner: 10Rush) [23:11:18] (03CR) 10Ppchelko: Add dewiktionary to RESTBase on Beta Cluster (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/321817 (https://phabricator.wikimedia.org/T150764) (owner: 10Mattflaschen) [23:12:51] (03PS1) 10Mattflaschen: Add German Wiktionary in beta (2nd try) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/322010 (https://phabricator.wikimedia.org/T150764) [23:13:24] (03PS3) 10Rush: gridengine: /var/spool/gridengine as a symlink [puppet] - 10https://gerrit.wikimedia.org/r/322008 [23:14:33] (03CR) 10jenkins-bot: [V: 04-1] gridengine: /var/spool/gridengine as a symlink [puppet] - 10https://gerrit.wikimedia.org/r/322008 (owner: 10Rush) [23:14:51] bd808 , any reason stashbot is not on gerrit/phab? [23:15:35] (03PS4) 10Rush: gridengine: /var/spool/gridengine as a symlink [puppet] - 10https://gerrit.wikimedia.org/r/322008 [23:16:10] arseny92: it's mirrored on phab [23:16:39] arseny92: https://phabricator.wikimedia.org/diffusion/1962/ [23:16:57] but I should move it to gerrit now that its more than a toy for me to play with [23:17:14] (03CR) 10Mattflaschen: [C: 032] "Instructions say to do this before addWiki.php. I'm about to run that, as soon as this is confirmed deployed." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/322010 (https://phabricator.wikimedia.org/T150764) (owner: 10Mattflaschen) [23:17:38] i havent found anything like wikimedia/bots/stashbot when i tried the search box on gerrit [23:17:44] (03Merged) 10jenkins-bot: Add German Wiktionary in beta (2nd try) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/322010 (https://phabricator.wikimedia.org/T150764) (owner: 10Mattflaschen) [23:18:40] it's not in gerrit. the origin is https://github.com/bd808/tools-stashbot [23:19:13] also since its on phab, you can do codesearch on it regardless of what others say [23:19:16] https://phabricator.wikimedia.org/diffusion/1962/browse/master/ [23:19:28] click show search [23:19:39] pattern [23:19:46] grep file content [23:20:07] 06Operations, 10Wikimedia-General-or-Unknown: Class 'Memcached' not found when running mwscript eval.php on mw1017, mw1099 - https://phabricator.wikimedia.org/T150912#2801021 (10Dereckson) [23:20:52] (03PS5) 10Rush: gridengine: /var/spool/gridengine as a symlink [puppet] - 10https://gerrit.wikimedia.org/r/322008 [23:23:41] (03PS6) 10Rush: gridengine: /var/spool/gridengine as a symlink [puppet] - 10https://gerrit.wikimedia.org/r/322008 [23:25:31] (03CR) 10Dzahn: [C: 032] "Yea, fwiw there are actually people who use our puppet manifests outside wmf." [puppet] - 10https://gerrit.wikimedia.org/r/321650 (https://phabricator.wikimedia.org/T150727) (owner: 10Hashar) [23:25:48] (03PS2) 10Dzahn: contint: remove Apache 2.2 compatibility config [puppet] - 10https://gerrit.wikimedia.org/r/321650 (https://phabricator.wikimedia.org/T150727) (owner: 10Hashar) [23:25:57] (03CR) 10Rush: [C: 032] gridengine: /var/spool/gridengine as a symlink [puppet] - 10https://gerrit.wikimedia.org/r/322008 (owner: 10Rush) [23:28:33] (03CR) 10Gergő Tisza: [C: 031] "Looks good, although maybe it would be more grep-friendly if there was something like "normal"/"elevated" instead of the list of groups." (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/321938 (owner: 10Reedy) [23:30:09] (03CR) 10Rush: [C: 04-1] "first, thanks for doing this. second, sorry I sort of stepped on your toes I went the symlink route and after spending all day cleaning up" [puppet] - 10https://gerrit.wikimedia.org/r/321584 (owner: 10Andrew Bogott) [23:30:13] (03Abandoned) 10Rush: Explicitly set up /var/spool/gridengine on grid master [puppet] - 10https://gerrit.wikimedia.org/r/321584 (owner: 10Andrew Bogott) [23:32:36] (03PS2) 10Reedy: Log users elevated groups on login attempts [mediawiki-config] - 10https://gerrit.wikimedia.org/r/321938 [23:35:55] (03PS3) 10Dzahn: contint: remove Apache 2.2 compatibility config [puppet] - 10https://gerrit.wikimedia.org/r/321650 (https://phabricator.wikimedia.org/T150727) (owner: 10Hashar) [23:51:08] (03CR) 10Arseny1992: [C: 04-1] Log users elevated groups on login attempts (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/321938 (owner: 10Reedy) [23:51:52] (03PS1) 10Dzahn: contint: move .htaccess content for doc/integration to puppet [puppet] - 10https://gerrit.wikimedia.org/r/322019 (https://phabricator.wikimedia.org/T150727) [23:52:34] 06Operations, 10Traffic, 05Prometheus-metrics-monitoring: Error collecting metrics from varnish_exporter on some misc hosts - https://phabricator.wikimedia.org/T150479#2787072 (10BBlack) Yeah the UUID in there is actually from the VCL. Every time we change VCL, it's recompiled and the output is given a UUID... [23:55:54] (03PS2) 10Dzahn: contint: move .htaccess content for doc/integration to puppet [puppet] - 10https://gerrit.wikimedia.org/r/322019 (https://phabricator.wikimedia.org/T150727) [23:56:50] (03PS5) 10Filippo Giunchedi: role: add prometheus 'global' instance [puppet] - 10https://gerrit.wikimedia.org/r/321814 (https://phabricator.wikimedia.org/T150486) [23:56:56] (03CR) 10Dzahn: "sounds good :) I made a patch to move the .htaccess files over to puppet at https://gerrit.wikimedia.org/r/#/c/322019/ and to delete them " [puppet] - 10https://gerrit.wikimedia.org/r/321651 (https://phabricator.wikimedia.org/T150727) (owner: 10Hashar) [23:57:10] (03CR) 10Dzahn: [C: 04-1] contint: allow .htaccess on doc.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/321651 (https://phabricator.wikimedia.org/T150727) (owner: 10Hashar) [23:59:19] (03PS3) 10Dzahn: contint: move .htaccess content for doc/integration to puppet [puppet] - 10https://gerrit.wikimedia.org/r/322019 (https://phabricator.wikimedia.org/T150727)