[01:44:19] PROBLEM - puppet last run on mw2238 is CRITICAL: CRITICAL: puppet fail [02:12:33] RECOVERY - puppet last run on mw2238 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [02:27:04] !log mwdeploy@tin scap sync-l10n completed (1.28.0-wmf.6) (duration: 10m 54s) [02:27:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [02:44:36] !log mwdeploy@tin scap sync-l10n completed (1.28.0-wmf.7) (duration: 08m 09s) [02:44:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [02:51:41] !log l10nupdate@tin ResourceLoader cache refresh completed at Mon Jun 27 02:51:41 UTC 2016 (duration 7m 5s) [02:51:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [03:34:08] PROBLEM - IPv4 ping to codfw on ripe-atlas-codfw is CRITICAL: CRITICAL - failed 29 probes of 393 (alerts on 19) - https://atlas.ripe.net/measurements/1791210/#!map [03:40:18] RECOVERY - IPv4 ping to codfw on ripe-atlas-codfw is OK: OK - failed 1 probes of 393 (alerts on 19) - https://atlas.ripe.net/measurements/1791210/#!map [04:48:44] (03PS1) 10KartikMistry: Deploy Compact Language Links as default (Stage 3) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/296180 (https://phabricator.wikimedia.org/T136677) [05:31:07] PROBLEM - puppet last run on lvs2002 is CRITICAL: CRITICAL: puppet fail [05:48:43] (03PS2) 10KartikMistry: Deploy Compact Language Links as default (Stage 3) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/296180 (https://phabricator.wikimedia.org/T136677) [05:56:22] (03PS3) 10KartikMistry: Deploy Compact Language Links as default (Stage 3) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/296180 (https://phabricator.wikimedia.org/T136677) [05:56:56] RECOVERY - puppet last run on lvs2002 is OK: OK: Puppet is currently enabled, last run 31 seconds ago with 0 failures [05:58:05] (03PS4) 10KartikMistry: Deploy Compact Language Links as default (Stage 3) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/296180 (https://phabricator.wikimedia.org/T136677) [06:31:03] PROBLEM - puppet last run on db2055 is CRITICAL: CRITICAL: puppet fail [06:31:12] PROBLEM - puppet last run on cp1053 is CRITICAL: CRITICAL: Puppet has 2 failures [06:31:51] PROBLEM - puppet last run on mc2007 is CRITICAL: CRITICAL: Puppet has 1 failures [06:32:02] PROBLEM - puppet last run on wtp2015 is CRITICAL: CRITICAL: Puppet has 2 failures [06:32:12] PROBLEM - puppet last run on mw2145 is CRITICAL: CRITICAL: Puppet has 2 failures [06:32:23] PROBLEM - puppet last run on db2058 is CRITICAL: CRITICAL: Puppet has 1 failures [06:34:10] PROBLEM - puppet last run on lvs2002 is CRITICAL: CRITICAL: Puppet has 1 failures [06:36:21] PROBLEM - puppet last run on lvs1003 is CRITICAL: CRITICAL: Puppet has 1 failures [06:47:01] I am checking analytics1049 fyi :) [06:48:56] this one is not lucky, we replaced a disk not long ago (https://phabricator.wikimedia.org/T137273) [06:55:21] RECOVERY - puppet last run on db2058 is OK: OK: Puppet is currently enabled, last run 0 seconds ago with 0 failures [06:55:21] RECOVERY - puppet last run on cp1053 is OK: OK: Puppet is currently enabled, last run 19 seconds ago with 0 failures [06:56:20] RECOVERY - puppet last run on lvs1003 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:56:41] RECOVERY - puppet last run on wtp2015 is OK: OK: Puppet is currently enabled, last run 20 seconds ago with 0 failures [06:57:01] RECOVERY - puppet last run on mw2145 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:57:20] RECOVERY - puppet last run on db2055 is OK: OK: Puppet is currently enabled, last run 42 seconds ago with 0 failures [06:58:12] RECOVERY - puppet last run on lvs2002 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:58:12] RECOVERY - puppet last run on mc2007 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [07:00:23] 06Operations, 10ops-eqiad, 10Analytics-Cluster, 06Analytics-Kanban: analytics1049.eqiad.wmnet disk failure - https://phabricator.wikimedia.org/T137273#2408707 (10elukey) @Cmjohnson the server is showing another disk failure, it seems that we are not lucky with this one: ``` [1600281.300136] EXT4-fs error... [07:15:16] !log puppet stopped on analytics1049 to remove it completely from the Hadoop cluster - broken disk [07:15:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [07:18:49] PROBLEM - puppet last run on cp3047 is CRITICAL: CRITICAL: puppet fail [07:45:40] RECOVERY - puppet last run on cp3047 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [07:46:14] (03PS1) 10Muehlenhoff: Remove access credentials for haithams [puppet] - 10https://gerrit.wikimedia.org/r/296187 [07:59:08] PROBLEM - puppet last run on cp4009 is CRITICAL: CRITICAL: puppet fail [08:12:40] (03PS1) 10Muehlenhoff: Update to Linux 4.4.14 [debs/linux44] - 10https://gerrit.wikimedia.org/r/296188 [08:12:49] RECOVERY - puppet last run on cp4009 is OK: OK: Puppet is currently enabled, last run 30 seconds ago with 0 failures [08:19:19] RECOVERY - MariaDB Slave SQL: m3 on dbstore1002 is OK: OK slave_sql_state Slave_SQL_Running: Yes [08:21:48] PROBLEM - puppet last run on cp4014 is CRITICAL: CRITICAL: puppet fail [08:23:42] (03CR) 10Muehlenhoff: [C: 032 V: 032] Update to Linux 4.4.14 [debs/linux44] - 10https://gerrit.wikimedia.org/r/296188 (owner: 10Muehlenhoff) [08:24:00] RECOVERY - puppet last run on cp4014 is OK: OK: Puppet is currently enabled, last run 13 seconds ago with 0 failures [08:30:35] (03PS3) 10Elukey: Add the -T VSL API timeout parameter plus the related formatter. [software/varnish/varnishkafka] - 10https://gerrit.wikimedia.org/r/295652 [08:33:38] RECOVERY - MariaDB Slave Lag: m3 on dbstore1002 is OK: OK slave_sql_lag Replication lag: 0.38 seconds [08:35:24] 06Operations, 10hardware-requests: esams: (3?) SAS 2TB disks for ms-be* systems - https://phabricator.wikimedia.org/T138618#2408754 (10fgiunchedi) true @Peachey88 , moved to #hardware-requests [08:39:15] (03CR) 10Filippo Giunchedi: [C: 031] "two nits, LGTM otherwise" (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/295975 (owner: 1020after4) [08:45:42] 06Operations, 10netops: Network ACL rules to allow traffic from Analytics to Production for port 9060 - https://phabricator.wikimedia.org/T138609#2405243 (10akosiaris) https://gerrit.wikimedia.org/r/#/c/295907/2 enabled 9160 (and 9160 is in the commit message). The task refers to 9060 (and 9061 in the long des... [08:47:45] 06Operations, 10netops: Network ACL rules to allow traffic from Analytics to Production for port 9160 - https://phabricator.wikimedia.org/T138609#2408764 (10elukey) [08:48:12] 06Operations, 10netops: Network ACL rules to allow traffic from Analytics to Production for port 9160 - https://phabricator.wikimedia.org/T138609#2405243 (10elukey) [08:48:50] 06Operations, 10netops: Network ACL rules to allow traffic from Analytics to Production for port 9160 - https://phabricator.wikimedia.org/T138609#2405243 (10elukey) Really sorry for the typos, brain not functioning correctly after a long day :) 9160 is the correct port thanks! [08:49:37] akosiaris: my brain was malfunctioning, ETOOMANYTYPOS --^ [08:50:52] (03CR) 10Alexandros Kosiaris: "why was that required in the first place ? Production is monitored in like 2 more places (per worker, LVS level). And with the swagger spe" [puppet] - 10https://gerrit.wikimedia.org/r/296054 (owner: 10Dzahn) [08:54:03] !log swift codfw-prod ms-be202[234] weight 2000 [08:54:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [08:59:17] (03PS2) 10Gehel: Add new elasticsearch servers to LVS [puppet] - 10https://gerrit.wikimedia.org/r/295657 (https://phabricator.wikimedia.org/T138329) [09:00:31] !log adding new elasticsearch servers in eqiad to LVS [09:00:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [09:02:39] (03CR) 10Gehel: [C: 032] Add new elasticsearch servers to LVS [puppet] - 10https://gerrit.wikimedia.org/r/295657 (https://phabricator.wikimedia.org/T138329) (owner: 10Gehel) [09:06:27] as usual forgot to set it earlier [09:06:58] !log gehel@palladium conftool action : set/pooled=yes; selector: elastic1032.eqiad.wmnet (tags: ['dc=eqiad', 'cluster=elasticsearch', 'service=elasticsearch']) [09:07:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [09:07:08] !log gehel@palladium conftool action : set/pooled=yes; selector: elastic1032.eqiad.wmnet (tags: ['dc=eqiad', 'cluster=elasticsearch', 'service=elasticsearch-ssl']) [09:08:52] (03CR) 10Filippo Giunchedi: [C: 031] Move graphite ferm rules out of role:graphite::base [puppet] - 10https://gerrit.wikimedia.org/r/295919 (owner: 10Muehlenhoff) [09:10:13] !log gehel@palladium conftool action : get/pooled; selector: elastic10??\.eqiad\.wmnet (tags: ['dc=eqiad', 'cluster=elasticsearch', 'service=elasticsearch']) [09:10:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [09:15:41] (03PS3) 10Muehlenhoff: Move graphite ferm rules out of role:graphite::base [puppet] - 10https://gerrit.wikimedia.org/r/295919 [09:15:59] (03CR) 10Muehlenhoff: [C: 032 V: 032] Move graphite ferm rules out of role:graphite::base [puppet] - 10https://gerrit.wikimedia.org/r/295919 (owner: 10Muehlenhoff) [09:18:14] !log gehel@palladium conftool action : set/pooled=yes; selector: dc=eqiad,cluster=elasticsearch,service=elasticsearch,name=elastic103..eqiad.wmnet [09:18:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [09:18:37] !log gehel@palladium conftool action : set/pooled=yes; selector: dc=eqiad,cluster=elasticsearch,service=elasticsearch-ssl,name=elastic103..eqiad.wmnet [09:18:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [09:19:40] !log gehel@palladium conftool action : set/pooled=yes; selector: dc=eqiad,cluster=elasticsearch,service=elasticsearch,name=elastic104..eqiad.wmnet [09:19:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [09:19:55] !log gehel@palladium conftool action : set/pooled=yes; selector: dc=eqiad,cluster=elasticsearch,service=elasticsearch-ssl,name=elastic104..eqiad.wmnet [09:24:47] 06Operations, 10Phabricator: Phabricator dependence on wmfusercontent.org - https://phabricator.wikimedia.org/T104730#2408825 (10Bawolff) I'm removing the security tag on this bug. My personal opinion is this should be declined. But either way its not a security issue, so I'm removing the security tag. [09:32:30] quiz time: does puppet create resource in parallel and/or is able to? [09:33:10] godog: wild guess, but I'd answer : no... [09:33:54] gehel: heh I suspected the same but my googling isn't able to confirm/deny [09:35:52] I'd assume 'no' too for now [09:37:37] I'm pretty sure the catalogue is serialized / linearize before execution, so we loose the graph of dependencies at that point (not sure what the correct term is) [09:38:45] execution order is at least deterministic (since 2.7 I think). It is probably possible to have concurrent and deterministic order, but hard enough to implement that I doubt very much it is done. [09:39:59] yup, I suppose the general problem has to do with dealing with 'slow' resources [09:40:08] see also comments on https://gerrit.wikimedia.org/r/#/c/295679/ [09:40:49] not a huge problem now heh, never thought about it before [09:42:59] 06Operations, 10Wikimedia-SVG-rendering, 13Patch-For-Review: Install Amiri font (arabic) for svg - https://phabricator.wikimedia.org/T135347#2408851 (10MoritzMuehlenhoff) 05Open>03Resolved Fixed. [09:44:23] (03PS4) 10Gehel: Remove old maps-test servers from LVS config [puppet] - 10https://gerrit.wikimedia.org/r/295640 [09:45:41] (03CR) 10Gehel: [C: 032] Remove old maps-test servers from LVS config [puppet] - 10https://gerrit.wikimedia.org/r/295640 (owner: 10Gehel) [09:47:42] !log removing maps-test*.codfw.wmnet servers from LVS (T138092) [09:47:43] T138092: Configure new maps servers in eqiad - https://phabricator.wikimedia.org/T138092 [09:47:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [09:48:36] !log stopping and reimporting db2010 (m1) [09:48:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [09:50:37] (03CR) 10Gehel: [C: 031] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/295437 (https://phabricator.wikimedia.org/T129144) (owner: 10Smalyshev) [09:51:12] (03CR) 10Gehel: [C: 04-1] "LGTM, but needs discussion in Ops weekly before merging." [puppet] - 10https://gerrit.wikimedia.org/r/295968 (https://phabricator.wikimedia.org/T138627) (owner: 10Smalyshev) [09:58:01] PROBLEM - HP RAID on ms-be2022 is CRITICAL: CHECK_NRPE: Socket timeout after 40 seconds. [10:00:20] RECOVERY - HP RAID on ms-be2022 is OK: OK: Slot 3: OK: 2I:4:1, 2I:4:2, 1I:1:5, 1I:1:6, 1I:1:7, 1I:1:8, 1I:1:1, 1I:1:2, 1I:1:3, 1I:1:4, 2I:2:1, 2I:2:2, 2I:2:3, 2I:2:4, Controller, Battery/Capacitor [10:02:14] srsly :( [10:05:10] load? [10:05:35] or is it the check timeout? [10:09:20] I think it might be both, but 2023 / 2024 should be loaded the same and didn't timeout [10:09:58] !log pooled mw1291 (jessie imagescaler) [10:10:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [10:11:27] db >= 1074 had only a few less drives and never timeout'd (although they have probably fewer iops) OK: Slot 1: OK: 1I:1:1, 1I:1:2, 1I:1:3, 1I:1:4, 1I:1:5, 1I:1:6, 1I:1:7, 1I:1:8, 2I:2:1, 2I:2:2, Controller, Battery/Capacitor [10:11:48] how many virtual drives ? [10:11:55] logical drives that is [10:12:14] logical? as in deviced exposed to the os? [10:12:37] only 1 [10:13:45] moritzm: Hi! Are we ok to install debian on the remaining imagescalers? https://etherpad.wikimedia.org/p/jessie-install - mw129[3-8] [10:16:26] jynus: yeah exposed to the os, for ms-be we expose all disks in raid0 as logical drives [10:16:40] ah, so way more operations [10:17:46] yeah at least 2x per drive [10:17:54] hpssacli invocation that is [10:18:37] an option would be to cache the calls, done with a cron, and just check the output? [10:20:05] elukey: yeah, currently writing a mail to wikitech to give a headsup [10:23:09] jynus: yeah that's an option too, and/or check less frequently anyway, not sure how often it is now, possibly icinga's default [10:28:45] (03PS4) 10Dzahn: Install arcanist from apt rather than git. [puppet] - 10https://gerrit.wikimedia.org/r/295975 (owner: 1020after4) [10:31:56] (03CR) 10Dzahn: [C: 032] Install arcanist from apt rather than git. [puppet] - 10https://gerrit.wikimedia.org/r/295975 (owner: 1020after4) [10:33:07] (03CR) 10Dzahn: "achievement unlocked, mile high club, merge something while at 10000 feet on a plane" [puppet] - 10https://gerrit.wikimedia.org/r/295975 (owner: 1020after4) [10:37:52] 06Operations, 10netops: Network ACL rules to allow traffic from Analytics to Production for port 9160 - https://phabricator.wikimedia.org/T138609#2408949 (10akosiaris) 05Open>03Resolved a:03akosiaris Port 9160 is now allowed. Resolving, feel free to reopen [10:38:12] (03PS1) 10Muehlenhoff: graphite/production: Limit to PRODUCTION_NETWORKS [puppet] - 10https://gerrit.wikimedia.org/r/296202 [10:42:57] (03CR) 10Dzahn: [C: 031] graphite/production: Limit to PRODUCTION_NETWORKS [puppet] - 10https://gerrit.wikimedia.org/r/296202 (owner: 10Muehlenhoff) [10:44:51] 06Operations, 10ContentTranslation-Deployments, 10ContentTranslation-cxserver, 10MediaWiki-extensions-ContentTranslation, and 4 others: Package and test apertium for Jessie - https://phabricator.wikimedia.org/T107306#2408963 (10KartikMistry) [10:49:43] (03PS1) 10KartikMistry: apertium-hbs-slv: Rebuild for Jessie and cleanup [debs/contenttranslation/apertium-hbs-slv] - 10https://gerrit.wikimedia.org/r/296203 (https://phabricator.wikimedia.org/T107306) [10:51:16] 06Operations, 10ContentTranslation-Deployments, 10ContentTranslation-cxserver, 10MediaWiki-extensions-ContentTranslation, and 4 others: Package and test apertium for Jessie - https://phabricator.wikimedia.org/T107306#2408970 (10KartikMistry) [10:54:52] (03PS11) 10Filippo Giunchedi: prometheus: add nginx reverse proxy [puppet] - 10https://gerrit.wikimedia.org/r/290479 (https://phabricator.wikimedia.org/T126785) [10:55:42] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] "merging, Ori we can discuss more on the ensure=>absent thing and change it in case" [puppet] - 10https://gerrit.wikimedia.org/r/290479 (https://phabricator.wikimedia.org/T126785) (owner: 10Filippo Giunchedi) [10:56:42] 06Operations, 10Ops-Access-Requests: Requesting access to deployment hosts (tin/terbium) for Brian Wolff - https://phabricator.wikimedia.org/T138635#2408989 (10Dzahn) @Bawolff Please read and sign L3 and attach a (new) SSH public key to this ticket. Thank you! [10:59:02] 06Operations, 10Jupyter-Hub: notebook1001 shown as DOWN in icinga, due to firewall rules - https://phabricator.wikimedia.org/T138685#2408990 (10Dzahn) Briefly talked to Yuvi and Madhu. These rules come from the docker package. [11:04:04] 06Operations, 10DBA, 13Patch-For-Review: Upgrade m1 db servers - https://phabricator.wikimedia.org/T135973#2408998 (10jcrespo) All m1 servers are now on jessie/MariaDB 10.0.23. [11:08:21] (03PS1) 10Filippo Giunchedi: site: add prometheus[12]00[12] [puppet] - 10https://gerrit.wikimedia.org/r/296205 (https://phabricator.wikimedia.org/T126785) [11:09:22] (03PS1) 10Dzahn: introduce zosma.codfw.wmnet [dns] - 10https://gerrit.wikimedia.org/r/296206 (https://phabricator.wikimedia.org/T138650) [11:12:53] (03CR) 10Dzahn: [C: 032] introduce zosma.codfw.wmnet [dns] - 10https://gerrit.wikimedia.org/r/296206 (https://phabricator.wikimedia.org/T138650) (owner: 10Dzahn) [11:17:22] 06Operations, 10DBA, 13Patch-For-Review: reimage or decom db servers on precise - https://phabricator.wikimedia.org/T125028#2409010 (10jcrespo) [11:17:24] 06Operations, 10DBA, 13Patch-For-Review: Upgrade m1 db servers - https://phabricator.wikimedia.org/T135973#2409009 (10jcrespo) 05Open>03Resolved [11:17:53] (03PS1) 10KartikMistry: apertium-oc-ca: Rebuild for Jessie and cleanup [debs/contenttranslation/apertium-oc-ca] - 10https://gerrit.wikimedia.org/r/296207 (https://phabricator.wikimedia.org/T107306) [11:18:13] 06Operations, 10DBA, 10Phabricator: Upgrade m3 (phabricator) db servers - https://phabricator.wikimedia.org/T138460#2409012 (10jcrespo) Thank you. [11:18:39] 06Operations, 10MediaWiki-extensions-UniversalLanguageSelector, 10Wikimedia-SVG-rendering, 07I18n: MB Lateefi Fonts for Sindhi Wikipedia. - https://phabricator.wikimedia.org/T138136#2409015 (10MoritzMuehlenhoff) While for being packaged in Debian/Ubuntu the TTF file is sufficient, ideally it should also pr... [11:21:43] (03PS1) 10KartikMistry: apertium-oc-es: Rebuild for Jessie and cleanup [debs/contenttranslation/apertium-oc-es] - 10https://gerrit.wikimedia.org/r/296209 (https://phabricator.wikimedia.org/T107306) [11:22:03] 06Operations, 10ops-codfw, 10DBA: es2017 and es2019 crashed with no logs - https://phabricator.wikimedia.org/T130702#2409032 (10jcrespo) p:05High>03Normal [11:22:57] 06Operations, 10ContentTranslation-Deployments, 10ContentTranslation-cxserver, 10MediaWiki-extensions-ContentTranslation, and 4 others: Package and test apertium for Jessie - https://phabricator.wikimedia.org/T107306#2409033 (10KartikMistry) [11:30:37] (03PS1) 10Dzahn: introduce zosma.codfw.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/296211 (https://phabricator.wikimedia.org/T138650) [11:31:44] (03CR) 10jenkins-bot: [V: 04-1] introduce zosma.codfw.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/296211 (https://phabricator.wikimedia.org/T138650) (owner: 10Dzahn) [11:34:04] 06Operations, 10vm-requests, 13Patch-For-Review, 05Security: provide ganeti VM for security team sectools - https://phabricator.wikimedia.org/T138650#2409039 (10Dzahn) [11:40:41] (03PS2) 10Dzahn: introduce zosma.codfw.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/296211 (https://phabricator.wikimedia.org/T138650) [11:40:43] PROBLEM - HP RAID on ms-be2023 is CRITICAL: CHECK_NRPE: Socket timeout after 40 seconds. [11:41:22] akosiaris: re: 134496 VM for url-downloader, should it be a misc server with an element name? [11:41:35] sigh :( jynus ^ timed out ms-be2023 too [11:41:46] have to go to lunch [11:42:11] (03PS1) 10KartikMistry: apertium-mk-bg: Rebuild for Jessie and cleanup [debs/contenttranslation/apertium-mk-bg] - 10https://gerrit.wikimedia.org/r/296212 (https://phabricator.wikimedia.org/T107306) [11:43:02] RECOVERY - HP RAID on ms-be2023 is OK: OK: Slot 3: OK: 2I:4:1, 2I:4:2, 1I:1:5, 1I:1:6, 1I:1:7, 1I:1:8, 1I:1:1, 1I:1:2, 1I:1:3, 1I:1:4, 2I:2:1, 2I:2:2, 2I:2:3, 2I:2:4, Controller, Battery/Capacitor [11:43:16] 06Operations, 10ContentTranslation-Deployments, 10ContentTranslation-cxserver, 10MediaWiki-extensions-ContentTranslation, and 4 others: Package and test apertium for Jessie - https://phabricator.wikimedia.org/T107306#2409046 (10KartikMistry) [11:44:11] mutante: yes, but it's blocked on T134242 which is why I haven't moved forward with it yet [11:44:11] T134242: kvm on ganeti instances getting stuck - https://phabricator.wikimedia.org/T134242 [11:44:39] (03CR) 10Dzahn: [C: 032] introduce zosma.codfw.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/296211 (https://phabricator.wikimedia.org/T138650) (owner: 10Dzahn) [11:45:37] akosiaris: hmm, i see, ok. well, i just was going to create another new VM [11:45:40] that said, T134242 is bound to be resolved this week [11:45:41] T134242: kvm on ganeti instances getting stuck - https://phabricator.wikimedia.org/T134242 [11:45:43] on the codfw cluster [11:45:48] aha! cool! [11:46:20] mutante: T134496 is for EQIAD though [11:46:21] T134496: EQIAD: (1) VM request for url-downloader - https://phabricator.wikimedia.org/T134496 [11:47:10] akosiaris: eh, right, yea, element names. ok [11:49:00] (03PS1) 10KartikMistry: apertium-is-sv: Rebuild for Jessie and cleanup [debs/contenttranslation/apertium-is-sv] - 10https://gerrit.wikimedia.org/r/296213 (https://phabricator.wikimedia.org/T107306) [11:49:24] 06Operations, 10ContentTranslation-Deployments, 10ContentTranslation-cxserver, 10MediaWiki-extensions-ContentTranslation, and 4 others: Package and test apertium for Jessie - https://phabricator.wikimedia.org/T107306#2409052 (10KartikMistry) [11:50:20] (03PS2) 10Muehlenhoff: saltmaster/production: Use PRODUCTION_NETWORKS [puppet] - 10https://gerrit.wikimedia.org/r/295782 [11:51:43] (03PS9) 10Alexandros Kosiaris: networks::constants: use slice_network_constants [puppet] - 10https://gerrit.wikimedia.org/r/291819 [11:53:24] (03CR) 10Alexandros Kosiaris: [C: 031] saltmaster/production: Use PRODUCTION_NETWORKS [puppet] - 10https://gerrit.wikimedia.org/r/295782 (owner: 10Muehlenhoff) [11:54:31] (03CR) 10Dzahn: [C: 031] saltmaster/production: Use PRODUCTION_NETWORKS [puppet] - 10https://gerrit.wikimedia.org/r/295782 (owner: 10Muehlenhoff) [11:55:10] (03CR) 10ArielGlenn: [C: 032] add job that dumps history of flow pages [dumps] - 10https://gerrit.wikimedia.org/r/295587 (https://phabricator.wikimedia.org/T89398) (owner: 10ArielGlenn) [11:55:30] (03CR) 10Muehlenhoff: [C: 032 V: 032] saltmaster/production: Use PRODUCTION_NETWORKS [puppet] - 10https://gerrit.wikimedia.org/r/295782 (owner: 10Muehlenhoff) [11:55:56] (03PS1) 10KartikMistry: apertium-mlt-ara: Rebuild for Jessie and new upstream [debs/contenttranslation/apertium-mlt-ara] - 10https://gerrit.wikimedia.org/r/296214 (https://phabricator.wikimedia.org/T107306) [11:56:25] 06Operations, 10ContentTranslation-Deployments, 10ContentTranslation-cxserver, 10MediaWiki-extensions-ContentTranslation, and 4 others: Package and test apertium for Jessie - https://phabricator.wikimedia.org/T107306#2409070 (10KartikMistry) [11:58:03] (03CR) 10Alexandros Kosiaris: [C: 032] "Finally the diff at https://puppet-compiler.wmflabs.org/3203/carbon.wikimedia.org/ is good enough. It contains any network that makes sens" [puppet] - 10https://gerrit.wikimedia.org/r/291819 (owner: 10Alexandros Kosiaris) [11:58:19] (03CR) 10Alexandros Kosiaris: [C: 032] networks::constants: use slice_network_constants [puppet] - 10https://gerrit.wikimedia.org/r/291819 (owner: 10Alexandros Kosiaris) [11:58:26] (03PS10) 10Alexandros Kosiaris: networks::constants: use slice_network_constants [puppet] - 10https://gerrit.wikimedia.org/r/291819 [11:58:31] (03CR) 10Alexandros Kosiaris: [V: 032] networks::constants: use slice_network_constants [puppet] - 10https://gerrit.wikimedia.org/r/291819 (owner: 10Alexandros Kosiaris) [11:58:44] (03CR) 10ArielGlenn: "I just realized something at the last minute. Is it possible that private wikis have url shorteners too?" [puppet] - 10https://gerrit.wikimedia.org/r/278400 (https://phabricator.wikimedia.org/T116986) (owner: 10ArielGlenn) [12:00:36] 06Operations, 10vm-requests, 13Patch-For-Review, 05Security: provide ganeti VM for security team sectools - https://phabricator.wikimedia.org/T138650#2409075 (10Dzahn) [12:00:40] (03CR) 10Alexandros Kosiaris: [C: 031] site: add prometheus[12]00[12] [puppet] - 10https://gerrit.wikimedia.org/r/296205 (https://phabricator.wikimedia.org/T126785) (owner: 10Filippo Giunchedi) [12:04:02] (03CR) 10Dzahn: "05:07 < akosiaris> we do have a url-downloader in labs" [puppet] - 10https://gerrit.wikimedia.org/r/295781 (owner: 10Muehlenhoff) [12:04:32] (03CR) 10Alexandros Kosiaris: [C: 04-1] url_downloader: Use PRODUCTION_NETWORKS in ferm rules (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/295781 (owner: 10Muehlenhoff) [12:14:07] (03PS1) 10Muehlenhoff: install_server::tftp_server: Use PRODUCTION_NETWORKS [puppet] - 10https://gerrit.wikimedia.org/r/296218 [12:15:06] (03CR) 10Alexandros Kosiaris: [C: 031] install_server::tftp_server: Use PRODUCTION_NETWORKS [puppet] - 10https://gerrit.wikimedia.org/r/296218 (owner: 10Muehlenhoff) [12:42:32] (03CR) 10Alexandros Kosiaris: "Just for posterity, the sphere private was just to keep backwards compatibility and discussion was moved into https://etherpad.wikimedia.o" [puppet] - 10https://gerrit.wikimedia.org/r/295333 (owner: 10Alexandros Kosiaris) [12:42:38] (03Abandoned) 10Alexandros Kosiaris: ferm: Populate INTERNAL from network::constants [puppet] - 10https://gerrit.wikimedia.org/r/295333 (owner: 10Alexandros Kosiaris) [12:45:06] 06Operations, 07Puppet, 13Patch-For-Review: Reconsider the aligning arrows puppet lint - https://phabricator.wikimedia.org/T137763#2378054 (10ArielGlenn) It's easier on my eyes; that's one of the things style checks do is make it easier for the eyes to pass over code so the reader can concentrate on the subs... [12:46:33] PROBLEM - Apache HTTP on mw1198 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50405 bytes in 0.005 second response time [12:48:53] RECOVERY - Apache HTTP on mw1198 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 626 bytes in 0.053 second response time [13:21:59] (03PS1) 10KartikMistry: apertium-hin: New upstream release and rebuild for Jessie [debs/contenttranslation/apertium-hin] - 10https://gerrit.wikimedia.org/r/296228 (https://phabricator.wikimedia.org/T107306) [13:22:48] 06Operations, 10ContentTranslation-Deployments, 10ContentTranslation-cxserver, 10MediaWiki-extensions-ContentTranslation, and 4 others: Package and test apertium for Jessie - https://phabricator.wikimedia.org/T107306#2409163 (10KartikMistry) [13:27:21] (03PS1) 10KartikMistry: apertium-urd: New upstream release and rebuild for Jessie [debs/contenttranslation/apertium-urd] - 10https://gerrit.wikimedia.org/r/296229 (https://phabricator.wikimedia.org/T107306) [13:27:46] 06Operations, 10ContentTranslation-Deployments, 10ContentTranslation-cxserver, 10MediaWiki-extensions-ContentTranslation, and 4 others: Package and test apertium for Jessie - https://phabricator.wikimedia.org/T107306#2409165 (10KartikMistry) [13:35:12] 06Operations, 10Monitoring: diamond: certain counters always calculated as 0 - https://phabricator.wikimedia.org/T138758#2409174 (10ema) [13:36:46] 06Operations, 10ops-eqiad: eqiad: Install SSD's into ganeti hosts - https://phabricator.wikimedia.org/T138414#2409187 (10akosiaris) p:05Triage>03Low So, `ganeti1004` is ready. There are no VMs on it, everything has been migrated to the other 3 VMs, so feel free to swap the disks with SSDs! [13:40:13] PROBLEM - HP RAID on ms-be2023 is CRITICAL: CHECK_NRPE: Socket timeout after 40 seconds. [13:42:23] RECOVERY - HP RAID on ms-be2023 is OK: OK: Slot 3: OK: 2I:4:1, 2I:4:2, 1I:1:5, 1I:1:6, 1I:1:7, 1I:1:8, 1I:1:1, 1I:1:2, 1I:1:3, 1I:1:4, 2I:2:1, 2I:2:2, 2I:2:3, 2I:2:4, Controller, Battery/Capacitor [13:44:53] PROBLEM - Check correctness of the icinga configuration on neon is CRITICAL: Icinga configuration contains errors [13:47:04] ^Could not find any host matching 'mw1293' [13:50:21] this is me! [13:50:27] same thing that happened last week [13:50:28] new nodes [13:50:41] I am also installing mw1294 [13:50:44] jynus: --^ [13:51:17] it's a race w/ icinga I imagine [13:51:57] yeah probably [13:52:06] I can't schedule downtime yet on neaon [13:52:08] *neon [13:53:55] it is ok, then [14:01:23] (03PS1) 10KartikMistry: hfst-ospell: Initial Debian packaging [debs/contenttranslation/hfst-ospell] - 10https://gerrit.wikimedia.org/r/296231 (https://phabricator.wikimedia.org/T107306) [14:01:55] Can somebody fix T138200 very soon? This is really annoing cswiki users. See the discussion https://cs.wikipedia.org/wiki/Wikipedie:Pod_l%C3%ADpou_(technika)#P.C5.99ehazuje_visual_editor_.C5.99.C3.A1dky_v_iboxech.3F.21 . Thanks for fixing. [14:01:55] T138200: TemplateData: Do not initialize paramOrder to paramNames if paramOrder has not been provided in the markup - https://phabricator.wikimedia.org/T138200 [14:04:51] 06Operations, 10ContentTranslation-Deployments, 10ContentTranslation-cxserver, 10MediaWiki-extensions-ContentTranslation, and 4 others: Package and test apertium for Jessie - https://phabricator.wikimedia.org/T107306#2409225 (10KartikMistry) [14:05:58] PROBLEM - Apache HTTP on mw1294 is CRITICAL: Connection timed out [14:05:59] PROBLEM - Apache HTTP on mw1293 is CRITICAL: Connection timed out [14:06:40] PROBLEM - nutcracker port on mw1293 is CRITICAL: Timeout while attempting connection [14:06:40] PROBLEM - nutcracker port on mw1294 is CRITICAL: Timeout while attempting connection [14:07:08] PROBLEM - nutcracker process on mw1293 is CRITICAL: Timeout while attempting connection [14:07:08] PROBLEM - nutcracker process on mw1294 is CRITICAL: Timeout while attempting connection [14:07:28] RECOVERY - Check correctness of the icinga configuration on neon is OK: Icinga configuration is correct [14:07:38] PROBLEM - puppet last run on mw1293 is CRITICAL: Timeout while attempting connection [14:07:38] PROBLEM - puppet last run on mw1294 is CRITICAL: Timeout while attempting connection [14:07:58] PROBLEM - salt-minion processes on mw1294 is CRITICAL: Timeout while attempting connection [14:07:59] PROBLEM - salt-minion processes on mw1293 is CRITICAL: Timeout while attempting connection [14:08:38] PROBLEM - Check size of conntrack table on mw1293 is CRITICAL: Timeout while attempting connection [14:08:38] PROBLEM - Check size of conntrack table on mw1294 is CRITICAL: Timeout while attempting connection [14:08:59] PROBLEM - DPKG on mw1294 is CRITICAL: Timeout while attempting connection [14:08:59] PROBLEM - DPKG on mw1293 is CRITICAL: Timeout while attempting connection [14:09:19] PROBLEM - Disk space on mw1294 is CRITICAL: Timeout while attempting connection [14:09:19] PROBLEM - Disk space on mw1293 is CRITICAL: Timeout while attempting connection [14:09:38] PROBLEM - MD RAID on mw1293 is CRITICAL: Timeout while attempting connection [14:09:38] PROBLEM - MD RAID on mw1294 is CRITICAL: Timeout while attempting connection [14:09:43] ouch [14:10:29] PROBLEM - configured eth on mw1294 is CRITICAL: Timeout while attempting connection [14:10:29] PROBLEM - configured eth on mw1293 is CRITICAL: Timeout while attempting connection [14:10:50] PROBLEM - dhclient process on mw1294 is CRITICAL: Timeout while attempting connection [14:10:50] PROBLEM - dhclient process on mw1293 is CRITICAL: Timeout while attempting connection [14:10:59] PROBLEM - mediawiki-installation DSH group on mw1294 is CRITICAL: Host mw1294 is not in mediawiki-installation dsh group [14:10:59] PROBLEM - mediawiki-installation DSH group on mw1293 is CRITICAL: Host mw1293 is not in mediawiki-installation dsh group [14:20:03] ah snap my bad, silencing [14:20:08] I was writing a wiki page [14:20:26] Luke081515: sorry new servers, nothing bad happening [14:20:57] that's good :) [14:26:50] RECOVERY - Apache HTTP on mw1294 is OK: HTTP OK: HTTP/1.1 200 OK - 11378 bytes in 0.002 second response time [14:29:29] RECOVERY - dhclient process on mw1294 is OK: PROCS OK: 0 processes with command name dhclient [14:29:38] RECOVERY - Check size of conntrack table on mw1294 is OK: OK: nf_conntrack is 0 % full [14:30:09] RECOVERY - nutcracker port on mw1294 is OK: TCP OK - 0.000 second response time on 127.0.0.1 port 11212 [14:30:29] RECOVERY - Disk space on mw1294 is OK: DISK OK [14:30:39] RECOVERY - nutcracker process on mw1294 is OK: PROCS OK: 1 process with UID = 110 (nutcracker), command name nutcracker [14:30:48] RECOVERY - MD RAID on mw1294 is OK: OK: Active: 4, Working: 4, Failed: 0, Spare: 0 [14:31:07] 06Operations, 06Discovery, 10Elasticsearch, 03Discovery-Search-Sprint, 13Patch-For-Review: Install and configure new elasticsearch servers in eqiad - https://phabricator.wikimedia.org/T138329#2409246 (10Gehel) [14:31:10] PROBLEM - puppet last run on aqs1002 is CRITICAL: CRITICAL: Puppet has 1 failures [14:31:29] RECOVERY - salt-minion processes on mw1294 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [14:31:30] 06Operations, 10ops-eqiad: rack/setup/install/deploy labsdb1009-labsdb1011 - https://phabricator.wikimedia.org/T136860#2409260 (10Cmjohnson) [14:31:49] RECOVERY - configured eth on mw1294 is OK: OK - interfaces up [14:32:52] (03CR) 10EBernhardson: [C: 031] "been running in beta cluster for a few days, everything looks good to me" [puppet] - 10https://gerrit.wikimedia.org/r/295578 (https://phabricator.wikimedia.org/T138335) (owner: 10EBernhardson) [14:33:06] 06Operations, 10ops-eqiad: eqiad: Install ssds to labmon1001 - https://phabricator.wikimedia.org/T138415#2409261 (10Cmjohnson) 05Open>03Resolved The ssds were added to labmon1001 and @robh copied data back from the external disk. [14:34:51] !log removing old elasticsearch servers in eqiad from LVS (elastic1001-1016 - T138329) [14:34:52] T138329: Install and configure new elasticsearch servers in eqiad - https://phabricator.wikimedia.org/T138329 [14:34:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:34:59] RECOVERY - DPKG on mw1294 is OK: All packages OK [14:37:32] !log gehel@palladium conftool action : get/pooled; selector: dc=eqiad,cluster=elasticsearch,service=elasticsearch,name=elastic100[0-9]..eqiad.wmnet [14:37:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:38:59] RECOVERY - Apache HTTP on mw1293 is OK: HTTP OK: HTTP/1.1 200 OK - 11378 bytes in 0.014 second response time [14:39:40] !log gehel@palladium conftool action : set/pooled=no; selector: dc=eqiad,cluster=elasticsearch,service=elasticsearch,name=elastic100[0-9].eqiad.wmnet [14:39:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:42:00] RECOVERY - nutcracker port on mw1293 is OK: TCP OK - 0.000 second response time on 127.0.0.1 port 11212 [14:42:19] RECOVERY - Disk space on mw1293 is OK: DISK OK [14:42:28] RECOVERY - nutcracker process on mw1293 is OK: PROCS OK: 1 process with UID = 110 (nutcracker), command name nutcracker [14:42:38] RECOVERY - MD RAID on mw1293 is OK: OK: Active: 4, Working: 4, Failed: 0, Spare: 0 [14:43:19] RECOVERY - salt-minion processes on mw1293 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [14:43:35] --^ new imagescaler getting up to speed [14:43:38] RECOVERY - configured eth on mw1293 is OK: OK - interfaces up [14:43:49] RECOVERY - dhclient process on mw1293 is OK: PROCS OK: 0 processes with command name dhclient [14:43:58] RECOVERY - Check size of conntrack table on mw1293 is OK: OK: nf_conntrack is 0 % full [14:45:21] (03PS11) 10EBernhardson: logstash: Update logstash for sending to es 2.x [puppet] - 10https://gerrit.wikimedia.org/r/295578 (https://phabricator.wikimedia.org/T138335) [14:46:39] RECOVERY - DPKG on mw1293 is OK: All packages OK [14:48:53] (03PS1) 10Muehlenhoff: Install fonts-noto-cjk on jessie image scalers [puppet] - 10https://gerrit.wikimedia.org/r/296237 (https://phabricator.wikimedia.org/T123223) [14:49:55] !log gehel@palladium conftool action : set/pooled=no; selector: dc=eqiad,cluster=elasticsearch,service=elasticsearch,name=elastic101[0-6].eqiad.wmnet [14:50:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:53:30] 06Operations, 06Discovery, 10Elasticsearch, 03Discovery-Search-Sprint, 13Patch-For-Review: Install and configure new elasticsearch servers in eqiad - https://phabricator.wikimedia.org/T138329#2409311 (10Gehel) [14:53:58] PROBLEM - CirrusSearch codfw 95th percentile latency on graphite1001 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [1000.0] [14:54:00] PROBLEM - Apache HTTP on mw1136 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:55:29] PROBLEM - HHVM rendering on mw1136 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:55:41] ^^ checking CirrusSearch codfw [14:56:19] PROBLEM - SSH on mw1136 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:56:29] PROBLEM - dhclient process on mw1136 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [14:56:29] PROBLEM - configured eth on mw1136 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [14:56:43] this one is not mine [14:56:49] PROBLEM - nutcracker process on mw1136 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [14:56:59] PROBLEM - Check size of conntrack table on mw1136 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [14:57:00] PROBLEM - nutcracker port on mw1136 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [14:57:10] PROBLEM - puppet last run on mw1136 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [14:57:19] PROBLEM - DPKG on mw1136 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [14:57:29] PROBLEM - salt-minion processes on mw1136 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [14:57:30] PROBLEM - Disk space on mw1136 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [14:57:58] PROBLEM - HHVM processes on mw1136 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [14:58:00] going to powercycle it [14:59:18] RECOVERY - puppet last run on aqs1002 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [15:00:00] !log mw1136 powercycled - not responsive to ssh and root login [15:00:04] anomie, ostriches, thcipriani, and marktraceur: Dear anthropoid, the time has come. Please deploy Morning SWAT(Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160627T1500). [15:00:04] Urbanecm and kart_: A patch you scheduled for Morning SWAT(Max 8 patches) is about to be deployed. Please be available during the process. [15:00:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:00:17] 06Operations, 10Phabricator: Phabricator dependence on wmfusercontent.org - https://phabricator.wikimedia.org/T104730#2409322 (10greg) 05Open>03declined Declining per Brandon's and Brian's last comments: >>! In T104730#1641121, @BBlack wrote: > Yes, basically once you separate out the T104735 concern, the... [15:00:45] (03CR) 10Thcipriani: "Looks like there is another rule that could be removed with this cleanup" (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/295744 (owner: 10Urbanecm) [15:01:11] I can SWAT today. [15:01:15] * kart_ here [15:01:39] kart_: kk, you're up :) [15:01:52] thcipriani: test deploy as usual :) [15:01:57] ack [15:01:59] RECOVERY - puppet last run on mw1136 is OK: OK: Puppet is currently enabled, last run 41 minutes ago with 0 failures [15:01:59] RECOVERY - DPKG on mw1136 is OK: All packages OK [15:02:10] (03CR) 10Thcipriani: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/296180 (https://phabricator.wikimedia.org/T136677) (owner: 10KartikMistry) [15:02:11] RECOVERY - Disk space on mw1136 is OK: DISK OK [15:02:11] RECOVERY - salt-minion processes on mw1136 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [15:02:26] 06Operations, 10ops-eqiad: rack/setup/install/deploy labvirt nodes - https://phabricator.wikimedia.org/T138509#2409328 (10Cmjohnson) [15:02:28] 06Operations, 10ops-eqiad: rack/setup/install/deploy labvirt nodes - https://phabricator.wikimedia.org/T138509#2409330 (10Cmjohnson) [15:02:46] (03Merged) 10jenkins-bot: Deploy Compact Language Links as default (Stage 3) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/296180 (https://phabricator.wikimedia.org/T136677) (owner: 10KartikMistry) [15:02:48] RECOVERY - nutcracker port on mw1136 is OK: TCP OK - 0.000 second response time on port 11212 [15:03:08] RECOVERY - SSH on mw1136 is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2.7 (protocol 2.0) [15:03:19] RECOVERY - configured eth on mw1136 is OK: OK - interfaces up [15:03:38] RECOVERY - dhclient process on mw1136 is OK: PROCS OK: 0 processes with command name dhclient [15:03:49] RECOVERY - nutcracker process on mw1136 is OK: PROCS OK: 1 process with UID = 108 (nutcracker), command name nutcracker [15:04:03] kart_: patch pulled to mw1017 [15:05:03] thcipriani: testing.. [15:05:19] RECOVERY - Apache HTTP on mw1136 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 627 bytes in 0.142 second response time [15:05:38] RECOVERY - Check size of conntrack table on mw1136 is OK: OK: nf_conntrack is 24 % full [15:05:48] thcipriani: looks good. go ahead. [15:05:49] RECOVERY - HHVM processes on mw1136 is OK: PROCS OK: 6 processes with command name hhvm [15:05:58] kart_: ack, doing. [15:06:09] RECOVERY - HHVM rendering on mw1136 is OK: HTTP OK: HTTP/1.1 200 OK - 72643 bytes in 0.372 second response time [15:06:22] Sorry for my lateness. I had some technical issues with my computer. [15:07:05] Are we SWATting? [15:07:09] PROBLEM - configured eth on mw1295 is CRITICAL: Connection refused by host [15:07:09] PROBLEM - configured eth on mw1296 is CRITICAL: Connection refused by host [15:07:18] PROBLEM - Apache HTTP on mw1296 is CRITICAL: Connection refused [15:07:18] PROBLEM - Apache HTTP on mw1295 is CRITICAL: Connection refused [15:07:29] Urbanecm: thcipriani actually deploys one patch [15:07:33] !log thcipriani@tin Synchronized wmf-config: SWAT: [[gerrit:296180|Deploy Compact Language Links as default (Stage 3)]] (duration: 00m 40s) [15:07:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:07:39] PROBLEM - dhclient process on mw1296 is CRITICAL: Connection refused by host [15:07:39] PROBLEM - dhclient process on mw1295 is CRITICAL: Connection refused by host [15:07:39] Thx Luke [15:07:42] ^ kart_ check please [15:07:58] Urbanecm: yup, just got started, I left a comment on your throttle cleanup patch, FYI. [15:07:59] PROBLEM - mediawiki-installation DSH group on mw1296 is CRITICAL: Host mw1296 is not in mediawiki-installation dsh group [15:07:59] PROBLEM - mediawiki-installation DSH group on mw1295 is CRITICAL: Host mw1295 is not in mediawiki-installation dsh group [15:08:03] thcipriani: checking. [15:08:12] mw1296 is me, install in progress [15:08:15] thcipriani: I'm working on it... [15:08:18] PROBLEM - nutcracker port on mw1295 is CRITICAL: Connection refused by host [15:08:20] PROBLEM - nutcracker port on mw1296 is CRITICAL: Timeout while attempting connection [15:08:24] Urbanecm: thanks :) [15:08:29] PROBLEM - nutcracker process on mw1295 is CRITICAL: Connection refused by host [15:09:00] mw1295 and mw1296 are my doing. Same issue as elukey before [15:10:58] ACKNOWLEDGEMENT - Apache HTTP on mw1295 is CRITICAL: Connection refused Gehel initial configuration in progress [15:10:58] ACKNOWLEDGEMENT - Check size of conntrack table on mw1295 is CRITICAL: Connection refused by host Gehel initial configuration in progress [15:10:59] ACKNOWLEDGEMENT - DPKG on mw1295 is CRITICAL: Connection refused by host Gehel initial configuration in progress [15:10:59] ACKNOWLEDGEMENT - Disk space on mw1295 is CRITICAL: Connection refused by host Gehel initial configuration in progress [15:10:59] ACKNOWLEDGEMENT - MD RAID on mw1295 is CRITICAL: Connection refused by host Gehel initial configuration in progress [15:11:08] thcipriani: something is wrong. [15:11:21] thcipriani: I do not see it deployed on piwiki, towiki etc [15:12:09] (03PS2) 10Urbanecm: [cleanup] Delete old throttle rules [mediawiki-config] - 10https://gerrit.wikimedia.org/r/295744 [15:12:12] thcipriani: wikiquote is fine. [15:12:24] thcipriani: syned, https://gerrit.wikimedia.org/r/#/c/296180/4/dblists/clldefault.dblist? [15:13:18] (03PS1) 10RobH: adding services team to parsoid admins [puppet] - 10https://gerrit.wikimedia.org/r/296239 [15:13:19] kart_: blerg, sorry, doing now. [15:13:31] !log thcipriani@tin Synchronized dblists/clldefault.dblist: SWAT: [[gerrit:296180|Deploy Compact Language Links as default (Stage 3)]] PART II (duration: 00m 23s) [15:13:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:13:36] ^ kart_ check please [15:14:08] (03PS2) 10RobH: adding services team to parsoid admins [puppet] - 10https://gerrit.wikimedia.org/r/296239 [15:14:12] (03CR) 10Urbanecm: "Removed." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/295744 (owner: 10Urbanecm) [15:14:23] gehel: heh, you fast acknoliging errors :D [15:14:28] thcipriani: I've removed the another old rule. [15:14:29] thcipriani: now good! [15:14:33] thcipriani: Thanks. [15:14:47] Luke081515: I was waiting for them... [15:14:55] kart_: thanks for checking, my brain's running a bit slow this Monday :) [15:15:03] Urbanecm: thank you! [15:15:04] (03CR) 10jenkins-bot: [V: 04-1] adding services team to parsoid admins [puppet] - 10https://gerrit.wikimedia.org/r/296239 (owner: 10RobH) [15:15:12] You're welcome. [15:15:28] RECOVERY - CirrusSearch codfw 95th percentile latency on graphite1001 is OK: OK: Less than 20.00% above the threshold [500.0] [15:15:50] (03CR) 10Thcipriani: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/295744 (owner: 10Urbanecm) [15:15:57] thcipriani: :) [15:16:08] !log banning elastic1001 to prepare its decommissioning (T138329) [15:16:09] T138329: Install and configure new elasticsearch servers in eqiad - https://phabricator.wikimedia.org/T138329 [15:16:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:16:18] hrmmm [15:16:22] i haz failure =[ [15:16:31] (03Merged) 10jenkins-bot: [cleanup] Delete old throttle rules [mediawiki-config] - 10https://gerrit.wikimedia.org/r/295744 (owner: 10Urbanecm) [15:16:37] 06Operations, 06Research-and-Data-Backlog, 10Research-management, 06Revision-Scoring-As-A-Service, and 3 others: [Epic] Deploy Revscoring/ORES service in Prod - https://phabricator.wikimedia.org/T106867#2409354 (10Halfak) Yeah. This was about getting ores.wikimedia.org online. Our plan is to keep ores.... [15:16:37] ahhh, typo [15:17:10] (03PS3) 10Thcipriani: Increase move rate limit for extendedmovers in enwiki to 16/60 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/296077 (https://phabricator.wikimedia.org/T138703) (owner: 10Urbanecm) [15:17:13] (03PS3) 10RobH: adding services team to parsoid admins [puppet] - 10https://gerrit.wikimedia.org/r/296239 [15:17:35] (03CR) 10Thcipriani: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/296077 (https://phabricator.wikimedia.org/T138703) (owner: 10Urbanecm) [15:17:49] RECOVERY - puppet last run on mw1294 is OK: OK: Puppet is currently enabled, last run 10 seconds ago with 0 failures [15:18:07] (03Merged) 10jenkins-bot: Increase move rate limit for extendedmovers in enwiki to 16/60 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/296077 (https://phabricator.wikimedia.org/T138703) (owner: 10Urbanecm) [15:19:31] !log thcipriani@tin Synchronized wmf-config/throttle.php: SWAT: [[gerrit:295744|Delete old throttle rules]] (duration: 00m 26s) [15:19:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:19:46] ^ Urbanecm throttle rule cleanup sync'd (thanks for that :)) [15:19:48] (03CR) 10RobH: [C: 032] "This was approved via the operations team private discussion via email (since we lacked an ops meeting this week due to Wikimania.)" [puppet] - 10https://gerrit.wikimedia.org/r/296239 (owner: 10RobH) [15:19:55] Thx [15:21:49] !log thcipriani@tin Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:296077|Increase move rate limit for extendedmovers in enwiki to 16/60]] (duration: 00m 28s) [15:21:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:22:03] marktraceur: just fyi https://wikitech.wikimedia.org/w/index.php?title=Deployments&action=historysubmit&type=revision&diff=690605&oldid=690504 [15:22:07] ^ Urbanecm move rate limit sync'd [15:22:12] THx [15:22:24] Urbanecm: thank you for the patches :) [15:22:35] You're welcome :) [15:23:27] (03CR) 10Ori.livneh: [C: 031] site: add prometheus[12]00[12] [puppet] - 10https://gerrit.wikimedia.org/r/296205 (https://phabricator.wikimedia.org/T126785) (owner: 10Filippo Giunchedi) [15:31:12] 06Operations, 10Ops-Access-Requests, 06Parsing-Team, 06Services: Allow the Services team to administer the Parsoid cluster - https://phabricator.wikimedia.org/T137879#2409394 (10RobH) 05stalled>03Resolved a:05RobH>03None There is no operations team meeting this week due to Wikimania. Since this wa... [15:40:22] greg-g: Thanks, I had meant to do that for some time [15:41:05] (03PS2) 10Gehel: Configure Kartotherian geoshapes support [puppet] - 10https://gerrit.wikimedia.org/r/295602 (https://phabricator.wikimedia.org/T134084) (owner: 10Yurik) [15:42:18] (03PS3) 10Gehel: Prevent geoshape service use by production [puppet] - 10https://gerrit.wikimedia.org/r/295703 (owner: 10Yurik) [15:43:42] marktraceur: now you don't get your automatic IRC ping of "SF is startign to wake up now" :) [15:43:57] Heh [15:44:06] 10:43 AM here [15:44:06] greg-g: Mostly I'm fine with it, but I haven't done a deploy for months [15:44:17] Bsadowski1: Central represent. [15:44:53] Yeah, when I wake up, it's 5 AM in Cali [15:44:57] lol [15:45:01] (03CR) 10Gehel: [C: 032] Prevent geoshape service use by production [puppet] - 10https://gerrit.wikimedia.org/r/295703 (owner: 10Yurik) [15:45:18] Which means greg-g is waking up too! [15:45:35] And James_F (on a normal day) has been awake for an hour or so, probably [15:47:54] marktraceur: s/waking up/getting on computer/ # there are two kids in my bed, I don't sleep in much anymore :) [15:48:04] Indeed. [15:48:43] greg-g: dude I feel you so much :) [15:49:09] (03PS3) 10Gehel: Configure Kartotherian geoshapes support [puppet] - 10https://gerrit.wikimedia.org/r/295602 (https://phabricator.wikimedia.org/T134084) (owner: 10Yurik) [15:52:30] RECOVERY - Apache HTTP on mw1296 is OK: HTTP OK: HTTP/1.1 200 OK - 11378 bytes in 0.011 second response time [15:55:01] (03PS1) 10Jhobs: Enable Wikibase descriptions on Catalan and Polish wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/296242 [15:55:08] is the server lagging? [15:55:15] what the heck [15:55:25] (03PS2) 10Jhobs: Enable Wikibase descriptions on Catalan and Polish wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/296242 (https://phabricator.wikimedia.org/T135429) [15:56:10] RECOVERY - dhclient process on mw1296 is OK: PROCS OK: 0 processes with command name dhclient [15:56:39] RECOVERY - nutcracker port on mw1296 is OK: TCP OK - 0.000 second response time on 127.0.0.1 port 11212 [15:56:41] RECOVERY - configured eth on mw1296 is OK: OK - interfaces up [15:56:56] (03PS1) 10Elukey: Add mw129[34] to the MediaWiki scap dsh list. [puppet] - 10https://gerrit.wikimedia.org/r/296244 [15:57:18] chasemp: and even when they sleep in, it's not like I can! Rowan was inverted in the bed, which meant I had no covers on me. [16:00:17] (03Abandoned) 10Elukey: Add mw129[34] to the MediaWiki scap dsh list. [puppet] - 10https://gerrit.wikimedia.org/r/296244 (owner: 10Elukey) [16:02:22] (03CR) 10Gehel: [C: 032] "Sufficient safeguards are in place (referer check), common understanding achieved with akosiaris." [puppet] - 10https://gerrit.wikimedia.org/r/295602 (https://phabricator.wikimedia.org/T134084) (owner: 10Yurik) [16:04:29] RECOVERY - Apache HTTP on mw1295 is OK: HTTP OK: HTTP/1.1 200 OK - 11378 bytes in 0.002 second response time [16:07:00] PROBLEM - puppet last run on mw2210 is CRITICAL: CRITICAL: puppet fail [16:07:20] RECOVERY - nutcracker port on mw1295 is OK: TCP OK - 0.000 second response time on 127.0.0.1 port 11212 [16:07:29] (03PS1) 10Rush: tools: allow users to create unpriv chroot [puppet] - 10https://gerrit.wikimedia.org/r/296245 [16:07:40] RECOVERY - nutcracker process on mw1295 is OK: PROCS OK: 1 process with UID = 110 (nutcracker), command name nutcracker [16:08:40] RECOVERY - configured eth on mw1295 is OK: OK - interfaces up [16:09:20] RECOVERY - dhclient process on mw1295 is OK: PROCS OK: 0 processes with command name dhclient [16:15:09] (03PS1) 10Elukey: Add mw129[34] to the MediaWiki DSH scap list. [puppet] - 10https://gerrit.wikimedia.org/r/296246 [16:16:06] (03CR) 10Elukey: [C: 032] Add mw129[34] to the MediaWiki DSH scap list. [puppet] - 10https://gerrit.wikimedia.org/r/296246 (owner: 10Elukey) [16:23:22] (03PS2) 10Filippo Giunchedi: site: add prometheus[12]00[12] [puppet] - 10https://gerrit.wikimedia.org/r/296205 (https://phabricator.wikimedia.org/T126785) [16:23:28] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] site: add prometheus[12]00[12] [puppet] - 10https://gerrit.wikimedia.org/r/296205 (https://phabricator.wikimedia.org/T126785) (owner: 10Filippo Giunchedi) [16:35:51] RECOVERY - puppet last run on mw2210 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [16:44:13] (03PS1) 10Filippo Giunchedi: prometheus: add init.pp class and expand documentation [puppet] - 10https://gerrit.wikimedia.org/r/296248 (https://phabricator.wikimedia.org/T126785) [16:47:31] 06Operations, 10Fundraising-Backlog: Allow Fundraising to A/B test wikipedia.org as send domain - https://phabricator.wikimedia.org/T135410#2409671 (10CCogdill_WMF) Pinging again as July is almost upon us, which is when we hope to run the wikimedia.org vs wikipedia.org domain test! We plan to send the email o... [16:53:29] thcipriani twentyafterfour ostriches I've assumed the meeting today won't happen post-wikimania, confirm [Yn]? [16:54:30] godog: Continue? ([d]etailed/[C]oncise report,[y]es,[n]o,[r]etry): [16:54:39] godog: eh, I had a couple questions but mostly for twentyafterfour actually. I suppose that means there's no need for a meeting. [16:54:46] ori: :D [16:54:52] ori: ^C^C^C [16:54:57] "there is no escape" [16:55:26] godog: I'll update the calendar invite. [16:55:44] thcipriani: sounds great, thanks! [17:00:05] gehel: Respected human, time to deploy Weekly Wikidata query service deployment window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160627T1700). Please do the needful. [17:04:46] (03CR) 10Filippo Giunchedi: "PCC https://puppet-compiler.wmflabs.org/3205/prometheus2001.codfw.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/296248 (https://phabricator.wikimedia.org/T126785) (owner: 10Filippo Giunchedi) [17:06:18] (03CR) 10EBernhardson: "would it make sense to pull es-tool into its own repo?" [puppet] - 10https://gerrit.wikimedia.org/r/290765 (owner: 10Gehel) [17:06:25] SMalyshev: Nothing to deploy for WDQS? [17:13:09] (03CR) 10Gehel: "yes, it would most probably make sense to move it. I just need to understand what is required for that. And also how to deploy it once it " [puppet] - 10https://gerrit.wikimedia.org/r/290765 (owner: 10Gehel) [17:14:53] RECOVERY - mediawiki-installation DSH group on mw1294 is OK: OK [17:14:53] RECOVERY - mediawiki-installation DSH group on mw1293 is OK: OK [17:15:01] (03PS12) 10Gehel: logstash: Update logstash for sending to es 2.x [puppet] - 10https://gerrit.wikimedia.org/r/295578 (https://phabricator.wikimedia.org/T138335) (owner: 10EBernhardson) [17:21:10] (03CR) 10Gehel: [C: 032] logstash: Update logstash for sending to es 2.x [puppet] - 10https://gerrit.wikimedia.org/r/295578 (https://phabricator.wikimedia.org/T138335) (owner: 10EBernhardson) [17:23:22] !log deploying new logstash config for transition to elasticsearch 2.x (T138335) [17:23:23] T138335: Backport de_dot plugin for logstash to 1.5.3 - https://phabricator.wikimedia.org/T138335 [17:23:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [17:36:39] 06Operations, 10Traffic, 13Patch-For-Review: Decrease max object TTL in varnishes - https://phabricator.wikimedia.org/T124954#2409739 (10Krinkle) >>! In T124954#2406470, @BBlack wrote: > Why don't we update IMS timestamp or ETag when cached parser output actually-changes [..] There is no detection of that k... [17:42:46] 06Operations, 10Traffic, 13Patch-For-Review: Decrease max object TTL in varnishes - https://phabricator.wikimedia.org/T124954#2409767 (10Krinkle) [17:52:42] (03PS1) 10Cmjohnson: Adding mgmt dns entries for new memcache, also updating missing asset tags for mc1009-1018. [dns] - 10https://gerrit.wikimedia.org/r/296252 [17:56:06] (03CR) 10Cmjohnson: [C: 032] Adding mgmt dns entries for new memcache, also updating missing asset tags for mc1009-1018. [dns] - 10https://gerrit.wikimedia.org/r/296252 (owner: 10Cmjohnson) [18:17:34] (03PS8) 10EBernhardson: Duplicate logstash output to alternate elasticsearch cluster [puppet] - 10https://gerrit.wikimedia.org/r/295442 [18:17:51] (03PS1) 10Cmjohnson: Adding mgmt dns entrires for labvirt1012-14, labsdb1009-11 and ms-be1022-27. [dns] - 10https://gerrit.wikimedia.org/r/296254 [18:18:02] (03CR) 10Ori.livneh: [C: 031] prometheus: add init.pp class and expand documentation [puppet] - 10https://gerrit.wikimedia.org/r/296248 (https://phabricator.wikimedia.org/T126785) (owner: 10Filippo Giunchedi) [18:20:45] (03CR) 10Cmjohnson: [C: 032] Adding mgmt dns entrires for labvirt1012-14, labsdb1009-11 and ms-be1022-27. [dns] - 10https://gerrit.wikimedia.org/r/296254 (owner: 10Cmjohnson) [18:27:49] 06Operations, 10ops-eqiad: rack/setup/install/deploy labsdb1009-labsdb1011 - https://phabricator.wikimedia.org/T136860#2409829 (10Cmjohnson) labsdb1009 1H IN A 10.65.4.59 wmf4797 1H IN A 10.65.4.59 labsdb1010 1H IN A 10.65.4.60 wmf4798 1H IN A 10.65.4.60 labsdb1011 1H IN A 10.65.... [18:28:12] 06Operations, 10ops-eqiad, 10media-storage: rack/setup/deploy ms-be102[2-7] - https://phabricator.wikimedia.org/T136631#2409832 (10Cmjohnson) ms-be1022 1H IN A 10.65.4.65 wmf6970 1H IN A 10.65.4.65 ms-be1023 1H IN A 10.65.4.66 wmf6971 1H IN A 10.65.4.66 ms-be1024 1H IN A 10.... [18:28:27] 06Operations, 10ops-eqiad: rack/setup/install/deploy labvirt nodes - https://phabricator.wikimedia.org/T138509#2409833 (10Cmjohnson) labvirt1012 1H IN A 10.65.4.62 wmf6976 1H IN A 10.65.4.62 labvirt1013 1H IN A 10.65.4.63 wmf6977 1H IN A 10.65.4.63 labvirt1014 1H IN A 10.65.4.64 wmf... [18:40:38] PROBLEM - HP RAID on ms-be2023 is CRITICAL: CHECK_NRPE: Socket timeout after 40 seconds. [18:43:45] (03PS1) 10Jdlrobson: Enable both performance experiments on small tagalog wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/296257 (https://phabricator.wikimedia.org/T137822) [18:47:18] RECOVERY - HP RAID on ms-be2023 is OK: OK: Slot 3: OK: 2I:4:1, 2I:4:2, 1I:1:5, 1I:1:6, 1I:1:7, 1I:1:8, 1I:1:1, 1I:1:2, 1I:1:3, 1I:1:4, 2I:2:1, 2I:2:2, 2I:2:3, 2I:2:4, Controller, Battery/Capacitor [18:47:47] PROBLEM - CirrusSearch codfw 95th percentile latency on graphite1001 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [1000.0] [19:00:44] (03PS1) 10Gehel: mediawiki: add mw1295/6 to mediawiki-installation [puppet] - 10https://gerrit.wikimedia.org/r/296260 [19:09:11] (03PS1) 10Gehel: Elasticsearch - Increase alerting threshold for alerts on codfw response time [puppet] - 10https://gerrit.wikimedia.org/r/296262 [19:10:19] (03CR) 10EBernhardson: [C: 031] Elasticsearch - Increase alerting threshold for alerts on codfw response time [puppet] - 10https://gerrit.wikimedia.org/r/296262 (owner: 10Gehel) [19:10:45] (03CR) 10Gehel: [C: 032] Elasticsearch - Increase alerting threshold for alerts on codfw response time [puppet] - 10https://gerrit.wikimedia.org/r/296262 (owner: 10Gehel) [19:15:36] RECOVERY - CirrusSearch codfw 95th percentile latency on graphite1001 is OK: OK: Less than 20.00% above the threshold [500.0] [19:18:52] PROBLEM - puppet last run on labvirt1010 is CRITICAL: CRITICAL: puppet fail [19:31:31] 06Operations, 06Discovery, 06Maps: Icinga is randomly loosing connectivity to maps1002 - https://phabricator.wikimedia.org/T138782#2409974 (10Gehel) [19:48:36] 06Operations, 10ops-codfw: codfw: return one intel ssd to dasher for warranty replacement - https://phabricator.wikimedia.org/T132210#2410017 (10Papaul) Tracking info for the returned SSD {F4206592} [19:52:06] 06Operations, 10ops-codfw, 06DC-Ops: Humidity Alarms from codfw - https://phabricator.wikimedia.org/T110421#2410020 (10Papaul) The panels are in place between Row A and B. {F4206607} @Rob if there is nothing else can we close this ticket? [19:58:34] (03PS9) 10EBernhardson: Duplicate logstash output to alternate elasticsearch cluster [puppet] - 10https://gerrit.wikimedia.org/r/295442 [20:00:04] gwicke, cscott, arlolra, subbu, bearND, and mdholloway: Respected human, time to deploy Services – Parsoid / OCG / Citoid / Mobileapps / … (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160627T2000). Please do the needful. [20:01:23] will deploy mobileapps in a bit [20:01:25] !log starting parsoid deploy [20:01:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [20:03:59] !log synced new parsoid code; restarted parsoid on wtp1001 as a canary [20:04:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [20:05:55] lgtm. restarting on all nodes. [20:08:15] !log finished deploying parsoid sha dd8e644d [20:08:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [20:08:20] time to verify ... [20:13:32] {{done}} [20:13:33] !log mobileapps deployed 30cc12e [20:13:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [20:16:01] 06Operations, 06Discovery, 06Maps: Icinga is randomly loosing connectivity to maps1002 - https://phabricator.wikimedia.org/T138782#2410088 (10Gehel) Some more investigation with @RobH: * ethtool reports link is at 1Go, no issue * icinga failures are spread across the whole day, so issue is probably not rela... [20:16:10] 06Operations, 06Discovery, 06Maps: Icinga is randomly loosing connectivity to maps1002 - https://phabricator.wikimedia.org/T138782#2410090 (10Gehel) a:03Cmjohnson [20:17:19] RECOVERY - puppet last run on labvirt1010 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [20:21:18] (03PS1) 10Dereckson: Namespace configuration for sk.wikibooks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/296270 (https://phabricator.wikimedia.org/T138779) [20:49:48] PROBLEM - puppet last run on mw1143 is CRITICAL: CRITICAL: Puppet has 10 failures [20:51:07] (03PS1) 10EBernhardson: Update kibana module for kibana 4 [puppet] - 10https://gerrit.wikimedia.org/r/296279 [20:52:29] (03CR) 10jenkins-bot: [V: 04-1] Update kibana module for kibana 4 [puppet] - 10https://gerrit.wikimedia.org/r/296279 (owner: 10EBernhardson) [21:02:51] (03PS2) 10EBernhardson: [WIP] Update kibana module for kibana 4 [puppet] - 10https://gerrit.wikimedia.org/r/296279 (https://phabricator.wikimedia.org/T138328) [21:04:36] (03CR) 10jenkins-bot: [V: 04-1] [WIP] Update kibana module for kibana 4 [puppet] - 10https://gerrit.wikimedia.org/r/296279 (https://phabricator.wikimedia.org/T138328) (owner: 10EBernhardson) [21:10:35] (03PS3) 10EBernhardson: [WIP] Update kibana module for kibana 4 [puppet] - 10https://gerrit.wikimedia.org/r/296279 (https://phabricator.wikimedia.org/T138328) [21:14:07] 06Operations, 10Deployment-Systems, 03Scap3: setup automatic deletion of old l10nupdate - https://phabricator.wikimedia.org/T130317#2410301 (10greg) [21:14:35] 06Operations, 06Labs, 10Labs-Infrastructure, 10netops, and 2 others: Firewall rules for labs support host to communicate with contint1001.eqiad.wmnet (new gallium) - https://phabricator.wikimedia.org/T137323#2410303 (10greg) [21:16:16] 06Operations, 10Wikimedia-Apache-configuration, 07HHVM, 07Wikimedia-log-errors: Fix Apache proxy_fcgi error "Invalid argument: AH01075: Error dispatching request to" (Causing HTTP 503) - https://phabricator.wikimedia.org/T73487#2410325 (10greg) [21:17:27] RECOVERY - puppet last run on mw1143 is OK: OK: Puppet is currently enabled, last run 33 seconds ago with 0 failures [21:32:48] greg-g: just updated https://phabricator.wikimedia.org/T138585 to reflect fine with me to rescind the thing as blocking [21:33:56] greg-g: i believe i marked it as a blocker to get a handle on the general problem. [21:34:23] greg-g: but that's no longer the case - the resourceloader bug was fixed [21:36:23] MatmaRex: ^ note for you, too. sorry for causing confusion! [21:49:37] PROBLEM - puppet last run on mw1143 is CRITICAL: CRITICAL: puppet fail [21:53:47] no problem. thanks [21:59:35] (03PS1) 10Jhobs: Introduce variable for wikidata taglines [mediawiki-config] - 10https://gerrit.wikimedia.org/r/296345 (https://phabricator.wikimedia.org/T138738) [22:03:28] (03PS2) 10Jhobs: Introduce variable for wikidata taglines [mediawiki-config] - 10https://gerrit.wikimedia.org/r/296345 (https://phabricator.wikimedia.org/T138738) [22:10:11] (03CR) 10Jdlrobson: [C: 04-1] Introduce variable for wikidata taglines (033 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/296345 (https://phabricator.wikimedia.org/T138738) (owner: 10Jhobs) [22:14:23] (03PS1) 10Jhobs: Introduce variable for wikidata taglines [mediawiki-config] - 10https://gerrit.wikimedia.org/r/296346 (https://phabricator.wikimedia.org/T138738) [22:14:50] (03CR) 10jenkins-bot: [V: 04-1] Introduce variable for wikidata taglines [mediawiki-config] - 10https://gerrit.wikimedia.org/r/296346 (https://phabricator.wikimedia.org/T138738) (owner: 10Jhobs) [22:15:51] (03Abandoned) 10Jhobs: Introduce variable for wikidata taglines [mediawiki-config] - 10https://gerrit.wikimedia.org/r/296345 (https://phabricator.wikimedia.org/T138738) (owner: 10Jhobs) [22:16:22] (03CR) 10Jhobs: "recheck" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/296346 (https://phabricator.wikimedia.org/T138738) (owner: 10Jhobs) [22:21:22] (03PS4) 10EBernhardson: [WIP] Update kibana module for kibana 4 [puppet] - 10https://gerrit.wikimedia.org/r/296279 (https://phabricator.wikimedia.org/T129138) [22:30:27] (03PS5) 10EBernhardson: [WIP] Update kibana module for kibana 4 [puppet] - 10https://gerrit.wikimedia.org/r/296279 (https://phabricator.wikimedia.org/T129138) [22:32:47] PROBLEM - CirrusSearch codfw 95th percentile latency on graphite1001 is CRITICAL: CRITICAL: 50.00% of data above the critical threshold [1500.0] [22:35:17] (03PS3) 10Jhobs: Enable Wikibase descriptions on Catalan and Polish wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/296242 (https://phabricator.wikimedia.org/T135429) [22:35:42] (03CR) 10jenkins-bot: [V: 04-1] Enable Wikibase descriptions on Catalan and Polish wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/296242 (https://phabricator.wikimedia.org/T135429) (owner: 10Jhobs) [22:35:58] (03PS1) 10Chad: WIP: Trying to automate generation of robots.txt files [puppet] - 10https://gerrit.wikimedia.org/r/296349 [22:37:12] (03CR) 10jenkins-bot: [V: 04-1] WIP: Trying to automate generation of robots.txt files [puppet] - 10https://gerrit.wikimedia.org/r/296349 (owner: 10Chad) [22:46:02] I have a problem with the patches I have scheduled for SWAT tonight. We had to do some rebasing to fix an urgent bug but I'm not sure I rebased it all properly, because the patches won't merge according to jenkins. Is anyone available to give me a hand by chance? [22:47:41] http://phabricator.wikimedia.net.ru/maniphest/ WTF? [22:48:03] Rrrrg if you click, don't think it's what it seems wwoooops! [22:48:22] (03CR) 10BryanDavis: [WIP] Update kibana module for kibana 4 (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/296279 (https://phabricator.wikimedia.org/T129138) (owner: 10EBernhardson) [22:48:44] jhobs: you mean https://gerrit.wikimedia.org/r/296242? [22:48:45] It's like a scrape of wmf phab, with "ads" and who knows what else [22:49:16] Just arrived at it via a google search [22:49:35] Luke081515: yes, https://gerrit.wikimedia.org/r/#/c/296242/ depends on https://gerrit.wikimedia.org/r/#/c/296346/ which depends on https://gerrit.wikimedia.org/r/#/c/296281/ [22:49:52] jhobs: before PS3, wou put these activations to wmgMFUseWikibaseDescription, now to wgMFDisplayWikibaseDescriptionsAsTaglines. expected? [22:50:00] yes [22:50:10] ok [22:50:12] hence why I needed to rebase them [22:50:52] let me see, if I can rebase it manually [22:50:54] but the mediawiki/extensions/MobileFrontend patch needs to be deployed first (obviously, as the parent dependency) [22:50:59] thanks Luke081515! [22:52:02] dapatrick: ^ note but I guess maybe don't click above link, seems possibly dangerous (sorry, shouldn't have blindly copied it here)... Also, hi! [22:53:29] (03PS2) 10Chad: WIP: Trying to automate generation of robots.txt files [puppet] - 10https://gerrit.wikimedia.org/r/296349 [22:54:32] AndyRussG Weird. I'll make a note of it. How did you find this? [22:54:49] dapatrick: just googling for a wikimedia-vagrant bug [22:55:14] I think we've seen that site before... [22:55:53] 06Operations, 10vm-requests, 13Patch-For-Review, 05Security: provide ganeti VM for security team sectools - https://phabricator.wikimedia.org/T138650#2410607 (10dpatrick) Thanks @Dzahn! [22:56:06] dapatrick: bd808: here's the google search: https://www.google.ca/search?q=Could+not+find+gem+%27mediawiki-vagrant+(%3D+0.13.2)%27+in+any+of+the+gem+sources+listed+in+your+Gemfile+or+available+on+this+machine.&ie=utf-8 [22:56:54] actually brought me to a scrape of a legit and relevant Phab task [22:57:12] i don't think it's a scrape, it's live-proxying [22:57:28] jhobs: I'm wondering, the first patch merges FF for me [22:57:47] I going to overwrite it (unchanged), maybe jenkins will accept that then [22:57:54] (03CR) 10EBernhardson: [WIP] Update kibana module for kibana 4 (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/296279 (https://phabricator.wikimedia.org/T129138) (owner: 10EBernhardson) [22:58:02] Luke081515: ok thanks. Yeah it all worked fine locally for me, so I was quite confused [22:58:21] huh [22:59:17] (03PS6) 10EBernhardson: [WIP] Update kibana module for kibana 4 [puppet] - 10https://gerrit.wikimedia.org/r/296279 (https://phabricator.wikimedia.org/T129138) [22:59:19] * AndyRussG resists seeing what the login button does.... [22:59:38] PROBLEM - puppet last run on subra is CRITICAL: CRITICAL: puppet fail [23:00:04] RoanKattouw, ostriches, Krenair, MaxSem, and Dereckson: Dear anthropoid, the time has come. Please deploy Evening SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160627T2300). [23:00:04] jhobs: A patch you scheduled for Evening SWAT (Max 8 patches) is about to be deployed. Please be available during the process. [23:00:27] ! [remote rejected] master -> refs/for/master (no new changes) [23:00:29] meh [23:00:38] :/ [23:00:47] let's try a workaround [23:01:07] (03CR) 10EBernhardson: "deployed to beta cluster (deployment-logstash2 has puppet disabled to allow this). Everything looks to be happy, minus the kibana .deb nee" [puppet] - 10https://gerrit.wikimedia.org/r/296279 (https://phabricator.wikimedia.org/T129138) (owner: 10EBernhardson) [23:01:12] Hi [23:01:17] (03PS2) 10Luke081515: Introduce variable for wikidata taglines [Temp to tell jenkins that it is mergeable] [mediawiki-config] - 10https://gerrit.wikimedia.org/r/296346 (https://phabricator.wikimedia.org/T138738) (owner: 10Jhobs) [23:01:25] I can SWAT this evening. [23:01:38] (03PS3) 10Luke081515: Introduce variable for wikidata taglines [mediawiki-config] - 10https://gerrit.wikimedia.org/r/296346 (https://phabricator.wikimedia.org/T138738) (owner: 10Jhobs) [23:01:42] (03CR) 10jenkins-bot: [V: 04-1] Introduce variable for wikidata taglines [mediawiki-config] - 10https://gerrit.wikimedia.org/r/296346 (https://phabricator.wikimedia.org/T138738) (owner: 10Jhobs) [23:01:47] Dereckson: we have actually a merge-problem :-/ [23:01:49] Dereckson: see conversation between Luke081515 and I for current state of my patches scheduled for SWAT [23:01:54] dapatrick: cool thx :) [23:01:56] jenkins tells that is not mergeable, but locally it does [23:02:01] (03CR) 10jenkins-bot: [V: 04-1] Introduce variable for wikidata taglines [mediawiki-config] - 10https://gerrit.wikimedia.org/r/296346 (https://phabricator.wikimedia.org/T138738) (owner: 10Jhobs) [23:02:35] Dereckson: maybe you can take a look, if you can fix it? [23:02:39] AndyRussG Sure thing. Would you like to be added to the Phab ticket? [23:02:44] sure, it'sp robably the dependency [23:02:53] AndyRussG I'm not sure there is anything we can do, but I'm creating one to track the issue. [23:02:56] you asked your patch to depends from Idb747699ebbba3e40100f697848cc10a980f1f0a [23:03:00] but there are two Idb747699ebbba3e40100f697848cc10a980f1f0a [23:03:19] yeah, i fucked up and put the change id on a patch by accident [23:03:25] it should be abandoned [23:03:31] so just remove the patch-id, or ? [23:03:35] Try to restore the abandonned patch [23:03:40] and change its patch id? [23:04:01] hm, that could work [23:04:09] that's what I did the first time and it created a separate patch but left the old one [23:04:13] dapatrick: sure! [23:04:13] (03Restored) 10Dereckson: Introduce variable for wikidata taglines [mediawiki-config] - 10https://gerrit.wikimedia.org/r/296345 (https://phabricator.wikimedia.org/T138738) (owner: 10Jhobs) [23:04:13] unless I just did it wrong? [23:04:37] well nope it won't work [23:04:49] Gerrit will create a new change if it's a new change ID, not delete this one [23:05:02] yeah, that's how we got https://gerrit.wikimedia.org/r/#/c/296346/ [23:05:17] (03Abandoned) 10Dereckson: Introduce variable for wikidata taglines [mediawiki-config] - 10https://gerrit.wikimedia.org/r/296345 (https://phabricator.wikimedia.org/T138738) (owner: 10Jhobs) [23:05:57] (03PS4) 10Dereckson: Introduce variable for wikidata taglines [mediawiki-config] - 10https://gerrit.wikimedia.org/r/296346 (https://phabricator.wikimedia.org/T138738) (owner: 10Jhobs) [23:06:27] Okay so PS4 here removes the Depends-On field, and manually states in the message the dependency for humans. [23:06:35] that's fine [23:06:51] as long as we SWAT the dependency first obviously [23:08:11] For 1.28.0-wmf.6, 1.28.0-wmf.7 or both? [23:08:23] both, it needs to go out now [23:08:28] fixes an urgent bug [23:09:39] Could not create a merge commit during the cherry pick [23:10:19] jhobs: I was able to cherry pick it for 1.28.0-wmf.7, but not for 1.28.0-wmf.6, that requires manual intervention [23:10:53] Dereckson: lemme check on the urgency. wmf.7 starts rolling out tomorrow, right? [23:11:02] double-check on the urgency* [23:11:06] yup [23:11:36] but it will only reach Wikipedia end of the week [23:11:55] Really? [23:12:03] I thought wmf.7 rolled out last week [23:12:19] * Dereckson checks the calendar [23:12:54] wikiversions.json shows wmf.7 everywhere except wikipedias [23:14:04] jhobs: so wikipedia will be wmf.8 (with your patch automatically in it) Thursday [23:14:56] Dereckson: no way to get it in prod today? A mobile feature depending on 2 of these 3 patches is broken [23:15:18] and I can't reach dr0ptp4kt for urgency evaluation [23:16:28] RECOVERY - CirrusSearch codfw 95th percentile latency on graphite1001 is OK: OK: Less than 20.00% above the threshold [1000.0] [23:16:52] jhobs: yes it's possible, I'm preparing the port for wmf/1.28.0-wmf.6 [23:17:42] Dereckson: thanks! [23:19:04] jhobs: Could you in MobileFrontend repository do this? git checkout b8d4fabb58ad16537e4990c0be9c1bbc487aca9f -b wmf/1.28.0-wmf.6 ; git cherry-pick c8b1ad9ac0179 [23:19:23] There is a merge conflict you will probably more at ease than me to fix [23:19:30] Dereckson: one sec [23:24:02] (03PS3) 10Chad: WIP: Trying to automate generation of robots.txt files [puppet] - 10https://gerrit.wikimedia.org/r/296349 [23:24:35] Dereckson: I assume you want me to `git review` it also? [23:24:42] (just don't want to break anything) [23:24:45] jhobs: once the merge conflict is solved, you have locally the wmf/1.28.0-wmf.6 (b8d4fabb) with your new change at the top (a new hash, rebasing c8b1ad9ac at the top of b8d4fabb), that's exactly the state of the branch you want to have. You can push it to Gerrit with git push origin HEAD:refs/for/wmf/1.28.0-wmf.6/T138738. [23:25:37] You wouldn't break anything but git review would have done a refs/for/master/wmf/1.28.0-wmf.6, trying to add your change to master. [23:26:01] Here, you want your change to be for wmf/1.28.0-wmf.6. [23:26:13] So the full syntax is refs/for// [23:26:44] hmm... getting an unauthorized error... [23:26:58] lemme double-check that i'm not just entering the password wrong or something stupid like that [23:27:16] perhaps you have several remotes, and gerrit is 'gerrit', not origin?. [23:27:18] RECOVERY - puppet last run on subra is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [23:27:23] (03PS4) 10Chad: WIP: Trying to automate generation of robots.txt files [puppet] - 10https://gerrit.wikimedia.org/r/296349 [23:27:41] If so it will be git push gerrit HEAD:refs/for/wmf/1.28.0-wmf.6/T138738 [23:28:04] Dereckson, why nor gerrit review wmf/1.28.0-wmf.6 ? [23:28:09] *not [23:28:13] Dereckson: thanks, that worked https://gerrit.wikimedia.org/r/296352 [23:28:51] You're welcome. [23:30:18] PROBLEM - puppet last run on ms-be2026 is CRITICAL: CRITICAL: Puppet has 1 failures [23:36:27] 06Operations: Create a simple puppet role for setting up a singlenode kubernetes install - https://phabricator.wikimedia.org/T138799#2410649 (10yuvipanda) [23:37:01] jhobs: fyi, we're waiting unit tests: https://integration.wikimedia.org/ci/job/mediawiki-extensions-php55/5242/console and https://integration.wikimedia.org/ci/job/mediawiki-extensions-php55/5243/console [23:37:41] Dereckson: that's cool, not surprised it's taking a while, we've got quite a few tests [23:39:06] Okay, they are merged. We can proceed. [23:39:22] (03PS5) 10Chad: WIP: Trying to automate generation of robots.txt files [puppet] - 10https://gerrit.wikimedia.org/r/296349 [23:39:26] 06Operations: Create a simple puppet role for setting up a singlenode kubernetes install - https://phabricator.wikimedia.org/T138799#2410663 (10yuvipanda) p:05Triage>03Low [23:40:33] (03CR) 10jenkins-bot: [V: 04-1] WIP: Trying to automate generation of robots.txt files [puppet] - 10https://gerrit.wikimedia.org/r/296349 (owner: 10Chad) [23:41:58] 06Operations: Create a simple puppet role for setting up a singlenode kubernetes install - https://phabricator.wikimedia.org/T138799#2410666 (10yuvipanda) A simple hyperkube package will also be useful for labs anyway, since we can then replace 6 different binaries with one hyperkube binary. [23:42:45] jhobs: so you confirm the deploy order is code first, config afterwards? [23:43:05] yes [23:43:15] and then obviously the configs have their dependency order [23:43:29] Dereckson: hi, can i add one more small patch to the swat? i'm creating the backport now. [23:43:36] MatmaRex: yes, go ahead [23:44:47] (03PS6) 10Chad: WIP: Trying to automate generation of robots.txt files [puppet] - 10https://gerrit.wikimedia.org/r/296349 [23:49:05] jhobs: okay, 'ntroduce config variable to control tagline' is live on mw1017 [23:49:57] jhobs: can you test there with https://wikitech.wikimedia.org/wiki/X-Wikimedia-Debug? [23:51:59] Dereckson: checking... [23:52:08] Dereckson: my patch is https://gerrit.wikimedia.org/r/296357 (added it to https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160627T2300 too) [23:54:39] Dereckson: to be clear, you haven't done the config part yet, right? Just the code? [23:54:50] jhobs: right [23:55:03] jhobs: I can send the config too if you wish [23:55:36] (03Abandoned) 10Chad: WIP: Trying to automate generation of robots.txt files [puppet] - 10https://gerrit.wikimedia.org/r/296349 (owner: 10Chad) [23:55:37] RECOVERY - puppet last run on ms-be2026 is OK: OK: Puppet is currently enabled, last run 30 seconds ago with 0 failures [23:55:46] Dereckson: yeah it'll be much easier to test that way I think. Right now I'm having to hunt in load.php. Site loads fine though, so I think we're good on the code [23:55:53] okay [23:56:02] (03CR) 10Dereckson: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/296346 (https://phabricator.wikimedia.org/T138738) (owner: 10Jhobs) [23:56:48] (03Merged) 10jenkins-bot: Introduce variable for wikidata taglines [mediawiki-config] - 10https://gerrit.wikimedia.org/r/296346 (https://phabricator.wikimedia.org/T138738) (owner: 10Jhobs) [23:58:03] jhobs: config on mw1017 too [23:58:59] Dereckson: wonderful! All looking great :) [23:59:31] Good. Thanks for testing [23:59:35] let's deploy all that now