[00:00:05] RoanKattouw ostriches Krenair: Respected human, time to deploy Evening SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160105T0000). Please do the needful. [00:03:04] OK GUYS HANG ON. CRAZY SWAT AHEAD [00:04:29] 6operations, 10ores, 7Icinga: change ores monitoring to avoid icinga reload on puppet runs - https://phabricator.wikimedia.org/T122830#1915488 (10Dzahn) [00:06:41] ostriches: Tsk. :-) [00:07:15] James_F: Don't be jealous that I nabbed such a high-profile swat. [00:08:02] ostriches: Let's just take the site down. Everyone's in SF at the conference, no-one needs the site, right? [00:08:28] I was just gonna `sync-file README` in a tight loop until *something* blew up [00:08:43] * James_F grins. [00:08:51] sync-dir is more likely to break things. [00:09:04] Yay partial deploy caching. [00:09:46] `sync-dir .` would probably be fun [00:09:56] All the code moving of scap without any of the checking :p [00:10:07] * James_F shudders. [00:10:30] no no, breaking the site is only cool when you do it whilst flying [00:10:53] do we have a barnstar for that yet? [00:11:07] Pfft, planes are easy now that they all have wifi [00:11:35] What's more fun is breaking it when everyone who can fix it is within 30 seconds of you and you *dont* get anyone to notice. *THAT* takes skill. [00:12:00] ostriches: what if Reedy does it once he becomes a pilot >.> [00:12:27] Orbital site breakage or go home (planet). [00:12:46] If I sync the code but Reedy flies into a dead zone with no reception: who broke the site? [00:13:28] ostriches: spin Leah's blame wheel and choose? [00:14:04] * twentyafterfour was a fan of the sleep-flag at deviantART. [00:14:23] block all deploys while catching some zzzs (or a long flight) [00:14:38] Is there an actual ticket for "scap should de-pool servers, update their code, then re-pool them"? [00:14:58] Yerp [00:15:08] Is it Declined? :-) [00:15:13] https://phabricator.wikimedia.org/T104352 [00:15:20] Oh, neat [00:15:21] http://20after4.deviantart.com/badges/ [00:15:31] Also https://phabricator.wikimedia.org/T73212 [00:17:25] That's for restarting the cluster, but I guess if it was performant enough it could also be used for SWATs right? [00:18:21] twentyafterfour: :-) [00:18:23] James_F: eventually swat deploys will be atomic, for sure [00:18:26] There won't be a sync-file/sync-dir [00:18:39] You scap/deploy it all, or nothing [00:18:54] * James_F wibbles in awe at the majesty of the future-portal's vision of the future. [00:26:01] (03PS1) 10Dzahn: icinga: add USER2 resource for /usr/local/lib/ [puppet] - 10https://gerrit.wikimedia.org/r/262417 [00:26:47] (03PS2) 10Dzahn: icinga: add USER4 resource for /usr/local/lib/ [puppet] - 10https://gerrit.wikimedia.org/r/262417 [00:27:14] (03PS3) 10Dzahn: icinga: add USER4 resource for /usr/local/lib/ [puppet] - 10https://gerrit.wikimedia.org/r/262417 (https://phabricator.wikimedia.org/T110893) [00:28:29] (03PS4) 10Dzahn: icinga: add USER4 resource for /usr/local/lib/ [puppet] - 10https://gerrit.wikimedia.org/r/262417 (https://phabricator.wikimedia.org/T110893) [00:29:16] (03PS5) 10Dzahn: icinga: add USER4 macro for /usr/local/lib/ [puppet] - 10https://gerrit.wikimedia.org/r/262417 (https://phabricator.wikimedia.org/T110893) [00:29:27] (03CR) 10Dzahn: [C: 032] icinga: add USER4 macro for /usr/local/lib/ [puppet] - 10https://gerrit.wikimedia.org/r/262417 (https://phabricator.wikimedia.org/T110893) (owner: 10Dzahn) [00:31:49] James_F: if the current window isn't used I can add it now? [00:32:02] Krinkle: Ask ostriches, he claimed the slot. :-) [00:32:10] https://gerrit.wikimedia.org/r/#/c/260789/ [00:32:59] * ostriches mutters something about '"High Priority" means security and data loss' [00:33:24] Krinkle: How much of a benefit is it? [00:33:28] * p858snake marks ostriches as Unbreak Now! [00:33:37] It just drops one RL module, right? [00:33:59] p858snake: I've been broken forevs. [00:34:10] James_F: Yes, and the synchronous Skin::getSkinNameMessages() fetch for every php request, and the failing messageblobstore fetch for every startup module request. [00:34:35] Krinkle: Yeah, I suppose. [00:37:22] Krinkle: {{approved}} [01:01:12] 6operations, 6RevisionScoringAsAService, 10ores, 7Monitoring: Add monitoring to ORES workers - https://phabricator.wikimedia.org/T121656#1915538 (10Dzahn) @halfak I noticed this: ``` curl ores.wmflabs.org/scores/testwiki/reverted/1234 Redirec... [01:19:07] <grrrit-wm> (03PS1) 10Dzahn: ores: monitor workers without service reloads [puppet] - 10https://gerrit.wikimedia.org/r/262418 (https://phabricator.wikimedia.org/T122830) [01:24:27] <grrrit-wm> (03PS2) 10Dzahn: ores: monitor workers without service reloads [puppet] - 10https://gerrit.wikimedia.org/r/262418 (https://phabricator.wikimedia.org/T122830) [01:45:23] <grrrit-wm> (03CR) 10Dzahn: "http://puppet-compiler.wmflabs.org/1558/neon.wikimedia.org/" [puppet] - 10https://gerrit.wikimedia.org/r/262418 (https://phabricator.wikimedia.org/T122830) (owner: 10Dzahn) [01:45:39] <grrrit-wm> (03PS3) 10Dzahn: ores: monitor workers without service reloads [puppet] - 10https://gerrit.wikimedia.org/r/262418 (https://phabricator.wikimedia.org/T122830) [01:47:00] <icinga-wm> PROBLEM - Hadoop NodeManager on analytics1040 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager [01:48:06] <wikibugs> 6operations, 6RevisionScoringAsAService, 10ores, 7Monitoring: Add monitoring to ORES workers - https://phabricator.wikimedia.org/T121656#1915557 (10Dzahn) like [[ https://gerrit.wikimedia.org/r/#/c/262418/3/modules/nagios_common/files/check_commands/check_ores_workers | this ]], using `check_http .. -u "ht... [01:55:11] <icinga-wm> RECOVERY - Hadoop NodeManager on analytics1040 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager [01:59:30] <grrrit-wm> (03PS2) 10Dzahn: dataset: move roles to modules/role/ [puppet] - 10https://gerrit.wikimedia.org/r/260940 [01:59:38] <grrrit-wm> (03CR) 10Dzahn: [C: 032] "no changes http://puppet-compiler.wmflabs.org/1559/" [puppet] - 10https://gerrit.wikimedia.org/r/260940 (owner: 10Dzahn) [02:05:52] <mutante> @seen ironholds [02:05:52] <wm-bot> mutante: ironholds is in #wikimedia-research right now [02:17:20] <grrrit-wm> (03CR) 10Dzahn: "@ArielGlenn was there a specific reason to wait with this?" [puppet] - 10https://gerrit.wikimedia.org/r/253594 (https://phabricator.wikimedia.org/T118739) (owner: 10ArielGlenn) [02:20:18] <icinga-wm> PROBLEM - Hadoop NodeManager on analytics1038 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager [02:24:52] <logmsgbot> !log mwdeploy@tin sync-l10n completed (1.27.0-wmf.9) (duration: 10m 13s) [02:24:58] <morebots> Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [02:26:02] <grrrit-wm> (03PS4) 10Dzahn: Add mgmt DNS entries for pc200[4-6] Add production DNS entries for 200[4-6] Bug:T121879 [dns] - 10https://gerrit.wikimedia.org/r/260942 (https://phabricator.wikimedia.org/T121879) (owner: 10Papaul) [02:28:37] <icinga-wm> RECOVERY - Hadoop NodeManager on analytics1038 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager [02:31:47] <logmsgbot> !log l10nupdate@tin ResourceLoader cache refresh completed at Tue Jan 5 02:31:46 UTC 2016 (duration 6m 54s) [02:31:52] <morebots> Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [02:43:35] <grrrit-wm> (03CR) 10Dzahn: [C: 032] Add mgmt DNS entries for pc200[4-6] Add production DNS entries for 200[4-6] Bug:T121879 [dns] - 10https://gerrit.wikimedia.org/r/260942 (https://phabricator.wikimedia.org/T121879) (owner: 10Papaul) [02:45:20] <grrrit-wm> (03CR) 10Dzahn: "pc2004.codfw.wmnet has address 10.192.16.170" [dns] - 10https://gerrit.wikimedia.org/r/260942 (https://phabricator.wikimedia.org/T121879) (owner: 10Papaul) [03:07:09] <grrrit-wm> (03PS2) 10Dzahn: Update wikimania redirects to 2016 [puppet] - 10https://gerrit.wikimedia.org/r/260593 (https://phabricator.wikimedia.org/T122207) (owner: 10Alex Monk) [03:33:46] <icinga-wm> PROBLEM - puppet last run on mw2166 is CRITICAL: CRITICAL: puppet fail [03:49:39] <logmsgbot> !log krinkle@tin Synchronized php-1.27.0-wmf.9/resources/Resources.php: Idaacf71870 (duration: 00m 36s) [03:49:44] <morebots> Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [03:50:24] <logmsgbot> !log krinkle@tin Synchronized php-1.27.0-wmf.9/resources/src/mediawiki.special/: Idaacf71870 (duration: 00m 30s) [03:50:29] <morebots> Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [03:51:03] <logmsgbot> !log krinkle@tin Synchronized php-1.27.0-wmf.9/includes/specials/SpecialJavaScriptTest.php: Idaacf71870 (duration: 00m 30s) [03:51:07] <morebots> Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [03:58:49] <wikibugs> 6operations, 10Deployment-Systems, 6Performance-Team, 6Release-Engineering-Team, 7HHVM: Translation cache exhaustion caused by changes to PHP code in file scope - https://phabricator.wikimedia.org/T103886#1915627 (10Krinkle) [03:59:17] <icinga-wm> PROBLEM - Outgoing network saturation on labstore1003 is CRITICAL: CRITICAL: 34.78% of data above the critical threshold [100000000.0] [04:00:02] <Krinkle> ostriches: {{done}} [04:00:27] <icinga-wm> RECOVERY - puppet last run on mw2166 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [04:00:43] <ostriches> Krinkle: Okie dokie [04:00:48] <Krinkle> James_F|Away: After the skin-* messages that were broken (and now fixed) the most common ones are mmv (70%) and flow (20%) [04:00:52] <Krinkle> https://logstash.wikimedia.org/#/dashboard/elasticsearch/resourceloader [04:02:34] <icinga-wm> PROBLEM - puppet last run on wtp2019 is CRITICAL: CRITICAL: puppet fail [04:22:53] <icinga-wm> RECOVERY - Outgoing network saturation on labstore1003 is OK: OK: Less than 10.00% above the threshold [75000000.0] [04:29:04] <icinga-wm> RECOVERY - puppet last run on wtp2019 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [04:47:44] <grrrit-wm> (03PS1) 10Lokal Profil: Support multiple compression formats for dumps [dumps/dcat] - 10https://gerrit.wikimedia.org/r/262422 (https://phabricator.wikimedia.org/T118397) [04:53:43] <grrrit-wm> (03PS1) 10Lokal Profil: Support bzip2 compression format [puppet] - 10https://gerrit.wikimedia.org/r/262423 (https://phabricator.wikimedia.org/T118397) [04:57:14] <grrrit-wm> (03CR) 10Lokal Profil: "The corresponding config.json change is in I8eb0077d60afbcc87663e3bdd41c5f04d959d330" [dumps/dcat] - 10https://gerrit.wikimedia.org/r/262422 (https://phabricator.wikimedia.org/T118397) (owner: 10Lokal Profil) [05:50:06] <icinga-wm> PROBLEM - Disk space on stat1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [05:50:36] <icinga-wm> PROBLEM - Disk space on analytics1027 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [05:50:46] <icinga-wm> PROBLEM - Disk space on analytics1026 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [05:51:35] <icinga-wm> PROBLEM - At least one Hadoop HDFS NameNode is active on analytics1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [06:20:37] <icinga-wm> PROBLEM - Outgoing network saturation on labstore1003 is CRITICAL: CRITICAL: 30.77% of data above the critical threshold [100000000.0] [06:29:57] <icinga-wm> PROBLEM - puppet last run on mw2136 is CRITICAL: CRITICAL: puppet fail [06:30:57] <icinga-wm> PROBLEM - puppet last run on cp2013 is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:17] <icinga-wm> PROBLEM - puppet last run on wtp2008 is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:26] <icinga-wm> PROBLEM - puppet last run on db1067 is CRITICAL: CRITICAL: Puppet has 2 failures [06:31:37] <icinga-wm> PROBLEM - puppet last run on mw2045 is CRITICAL: CRITICAL: Puppet has 2 failures [06:34:13] <icinga-wm> PROBLEM - puppet last run on mw2018 is CRITICAL: CRITICAL: Puppet has 1 failures [06:36:44] <icinga-wm> PROBLEM - puppet last run on analytics1027 is CRITICAL: CRITICAL: puppet fail [06:45:13] <icinga-wm> RECOVERY - Outgoing network saturation on labstore1003 is OK: OK: Less than 10.00% above the threshold [75000000.0] [06:45:14] <icinga-wm> PROBLEM - puppet last run on analytics1001 is CRITICAL: CRITICAL: puppet fail [06:57:23] <icinga-wm> RECOVERY - puppet last run on wtp2008 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:57:53] <icinga-wm> RECOVERY - puppet last run on mw2136 is OK: OK: Puppet is currently enabled, last run 31 seconds ago with 0 failures [06:57:54] <icinga-wm> RECOVERY - puppet last run on cp2013 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:58:03] <icinga-wm> RECOVERY - puppet last run on db1067 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:58:53] <icinga-wm> RECOVERY - puppet last run on mw2018 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:58:55] <icinga-wm> RECOVERY - puppet last run on mw2045 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [08:11:38] <icinga-wm> PROBLEM - Ulsfo HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [1000.0] [08:15:38] <icinga-wm> PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [1000.0] [08:19:19] <icinga-wm> PROBLEM - Outgoing network saturation on labstore1003 is CRITICAL: CRITICAL: 13.64% of data above the critical threshold [100000000.0] [08:19:39] <icinga-wm> RECOVERY - Text HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [08:19:49] <icinga-wm> RECOVERY - Ulsfo HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [08:23:49] <icinga-wm> PROBLEM - puppet last run on mw2098 is CRITICAL: CRITICAL: puppet fail [08:50:15] <icinga-wm> RECOVERY - puppet last run on mw2098 is OK: OK: Puppet is currently enabled, last run 30 seconds ago with 0 failures [08:53:36] <icinga-wm> RECOVERY - Outgoing network saturation on labstore1003 is OK: OK: Less than 10.00% above the threshold [75000000.0] [09:05:15] <icinga-wm> PROBLEM - check_mysql on db1008 is CRITICAL: SLOW_SLAVE CRITICAL: Slave IO: Yes Slave SQL: Yes Seconds Behind Master: 619 [09:05:15] <icinga-wm> PROBLEM - check_mysql on lutetium is CRITICAL: SLOW_SLAVE CRITICAL: Slave IO: Yes Slave SQL: Yes Seconds Behind Master: 619 [09:15:15] <icinga-wm> RECOVERY - check_mysql on db1008 is OK: Uptime: 1269501 Threads: 2 Questions: 40559990 Slow queries: 14742 Opens: 59209 Flush tables: 2 Open tables: 415 Queries per second avg: 31.949 Slave IO: Yes Slave SQL: Yes Seconds Behind Master: 0 [09:20:15] <icinga-wm> RECOVERY - check_mysql on lutetium is OK: Uptime: 51544 Threads: 1 Questions: 252497 Slow queries: 358 Opens: 3708 Flush tables: 2 Open tables: 64 Queries per second avg: 4.898 Slave IO: Yes Slave SQL: Yes Seconds Behind Master: 0 [09:54:46] <icinga-wm> PROBLEM - salt-minion processes on mira is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/salt-minion [10:04:11] <wikibugs> 6operations, 10Continuous-Integration-Infrastructure: Test mwext-qunit-composer database disk image is malformed - https://phabricator.wikimedia.org/T122599#1915834 (10Paladox) 5Open>3declined a:3Paladox Declining since the problem seems to not happen any more. [10:08:19] <grrrit-wm> (03CR) 10Paladox: "check experimental" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/189148 (https://phabricator.wikimedia.org/T85947) (owner: 10Legoktm) [10:18:54] <icinga-wm> RECOVERY - salt-minion processes on mira is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [10:24:12] <grrrit-wm> (03CR) 10Paladox: "recheck" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/189148 (https://phabricator.wikimedia.org/T85947) (owner: 10Legoktm) [10:28:50] <grrrit-wm> (03CR) 10Paladox: "We probably need to also require event-schemas repo." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/189148 (https://phabricator.wikimedia.org/T85947) (owner: 10Legoktm) [10:30:25] <ezachte> pls assist, can't access stat1002:/mnt/ (I hope this is right channel to ask, I always forget, too few ops problems ;) [10:31:43] <ezachte> du /mnt hangs [10:46:37] <grrrit-wm> (03CR) 10Hashar: [C: 031] RuboCop: fixed Style/CommandLiteral offense [puppet] - 10https://gerrit.wikimedia.org/r/259706 (https://phabricator.wikimedia.org/T112651) (owner: 10Zfilipin) [10:46:57] <grrrit-wm> (03CR) 10Hashar: [C: 031] RuboCop: Fixed Style/DefWithParentheses offence [puppet] - 10https://gerrit.wikimedia.org/r/259708 (https://phabricator.wikimedia.org/T112651) (owner: 10Zfilipin) [10:47:23] <grrrit-wm> (03CR) 10Hashar: [C: 031] RuboCop: fixed Style/EmptyLiteral offense [puppet] - 10https://gerrit.wikimedia.org/r/259710 (https://phabricator.wikimedia.org/T112651) (owner: 10Zfilipin) [10:48:17] <grrrit-wm> (03CR) 10Hashar: [C: 031] RuboCop: fixed Style/DotPosition offense [puppet] - 10https://gerrit.wikimedia.org/r/259712 (https://phabricator.wikimedia.org/T112651) (owner: 10Zfilipin) [10:51:38] <grrrit-wm> (03CR) 10Hashar: [C: 04-1] RuboCop: fixed Style/LeadingCommentSpace offense (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/259717 (https://phabricator.wikimedia.org/T112651) (owner: 10Zfilipin) [10:51:51] <grrrit-wm> (03CR) 10Hashar: [C: 031] RuboCop: fixed Style/MethodCallParentheses offense [puppet] - 10https://gerrit.wikimedia.org/r/259718 (https://phabricator.wikimedia.org/T112651) (owner: 10Zfilipin) [10:52:14] <grrrit-wm> (03CR) 10Hashar: [C: 031] RuboCop: fixed Style/MultilineIfThen offense [puppet] - 10https://gerrit.wikimedia.org/r/259719 (https://phabricator.wikimedia.org/T112651) (owner: 10Zfilipin) [10:54:46] <grrrit-wm> (03CR) 10Hashar: [C: 04-1] "I would not enforce usage of unless. I find it often confusing specially with conditions such as:" [puppet] - 10https://gerrit.wikimedia.org/r/259722 (https://phabricator.wikimedia.org/T112651) (owner: 10Zfilipin) [10:55:32] <grrrit-wm> (03CR) 10Hashar: [C: 04-1] "We should just ignore this rule. Looks too confusing to me." [puppet] - 10https://gerrit.wikimedia.org/r/259725 (https://phabricator.wikimedia.org/T112651) (owner: 10Zfilipin) [10:55:54] <grrrit-wm> (03CR) 10Hashar: [C: 031] RuboCop: fixed Style/ParallelAssignment offense [puppet] - 10https://gerrit.wikimedia.org/r/259726 (https://phabricator.wikimedia.org/T112651) (owner: 10Zfilipin) [10:56:48] <grrrit-wm> (03CR) 10Hashar: [C: 031] "I don't mind not :-) So no I have no opinion about this rule." [puppet] - 10https://gerrit.wikimedia.org/r/259724 (https://phabricator.wikimedia.org/T112651) (owner: 10Zfilipin) [11:00:29] <grrrit-wm> (03Abandoned) 10Hashar: tox integration to run flake8 [dumps] - 10https://gerrit.wikimedia.org/r/242494 (https://phabricator.wikimedia.org/T55354) (owner: 10Hashar) [12:18:54] <icinga-wm> PROBLEM - Hadoop Namenode - Primary on analytics1001 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.hdfs.server.namenode.NameNode [12:26:29] <icinga-wm> PROBLEM - puppet last run on analytics1026 is CRITICAL: CRITICAL: Puppet has 1 failures [12:30:39] <icinga-wm> PROBLEM - puppet last run on stat1002 is CRITICAL: CRITICAL: Puppet has 1 failures [12:42:33] <icinga-wm> RECOVERY - Hadoop Namenode - Primary on analytics1001 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.hdfs.server.namenode.NameNode [13:03:13] <grrrit-wm> (03PS2) 10Hashar: contint: drop publish-console [puppet] - 10https://gerrit.wikimedia.org/r/260190 (owner: 10Dzahn) [13:03:53] <grrrit-wm> (03CR) 10Hashar: [C: 031] "The publish console has never been used. So I amended this change to remove the related puppet stuff." [puppet] - 10https://gerrit.wikimedia.org/r/260190 (owner: 10Dzahn) [13:23:31] <icinga-wm> PROBLEM - PyBal backends health check on lvs1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:25:22] <icinga-wm> RECOVERY - PyBal backends health check on lvs1001 is OK: PYBAL OK - All pools are healthy [14:03:25] <grrrit-wm> (03PS4) 10Hashar: tox entry point to run pep8==1.4.6 [puppet] - 10https://gerrit.wikimedia.org/r/244148 (https://phabricator.wikimedia.org/T114887) [14:03:39] <grrrit-wm> (03CR) 10Hashar: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/244148 (https://phabricator.wikimedia.org/T114887) (owner: 10Hashar) [14:05:14] <grrrit-wm> (03CR) 10jenkins-bot: [V: 04-1] tox entry point to run pep8==1.4.6 [puppet] - 10https://gerrit.wikimedia.org/r/244148 (https://phabricator.wikimedia.org/T114887) (owner: 10Hashar) [14:23:47] <wikibugs> 6operations, 10Traffic: Evaluate and Test Limited Deployment of Varnish 4 - https://phabricator.wikimedia.org/T122880#1916092 (10BBlack) 3NEW [14:26:29] <wikibugs> 6operations, 10Traffic: Install XKey vmod - https://phabricator.wikimedia.org/T122881#1916107 (10BBlack) 3NEW [14:26:45] <wikibugs> 6operations, 10Traffic: Evaluate and Test Limited Deployment of Varnish 4 - https://phabricator.wikimedia.org/T122880#1916115 (10BBlack) [14:26:45] <wikibugs> 6operations, 10Traffic: Install XKey vmod - https://phabricator.wikimedia.org/T122881#1916114 (10BBlack) [14:27:03] <wikibugs> 6operations, 10Traffic: Install XKey vmod - https://phabricator.wikimedia.org/T122881#1916107 (10BBlack) [14:27:04] <wikibugs> 6operations, 10Traffic: Evaluate and Test Limited Deployment of Varnish 4 - https://phabricator.wikimedia.org/T122880#1916092 (10BBlack) [14:49:37] <icinga-wm> PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [1000.0] [14:50:57] <icinga-wm> PROBLEM - Ulsfo HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 66.67% of data above the critical threshold [1000.0] [15:06:10] <icinga-wm> RECOVERY - Text HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [15:06:30] <icinga-wm> RECOVERY - Ulsfo HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [15:32:31] <wikibugs> 6operations, 10netops: Orange S.A. searches a contact at WMF for tests - https://phabricator.wikimedia.org/T122293#1916177 (10Sylvain_WMFr) So, noc@wikimedia.org? [15:42:29] <grrrit-wm> (03PS1) 10Muehlenhoff: Add maintainer scripts for 3.19 kernels [debs/linux] - 10https://gerrit.wikimedia.org/r/262455 (https://phabricator.wikimedia.org/T122284) [15:47:04] <icinga-wm> RECOVERY - At least one Hadoop HDFS NameNode is active on analytics1001 is OK: Hadoop Active NameNode OKAY: analytics1001-eqiad-wmnet [15:47:17] <ottomata> !log transitioned analytics1001 to active namenode [15:47:22] <morebots> Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:47:25] <icinga-wm> RECOVERY - Disk space on stat1002 is OK: DISK OK [15:47:36] <icinga-wm> RECOVERY - Disk space on analytics1027 is OK: DISK OK [15:47:44] <icinga-wm> RECOVERY - Disk space on analytics1026 is OK: DISK OK [15:52:25] <icinga-wm> RECOVERY - puppet last run on stat1002 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [15:54:03] <grrrit-wm> (03CR) 10Muehlenhoff: [C: 032 V: 032] Add maintainer scripts for 3.19 kernels [debs/linux] - 10https://gerrit.wikimedia.org/r/262455 (https://phabricator.wikimedia.org/T122284) (owner: 10Muehlenhoff) [15:58:35] <icinga-wm> RECOVERY - puppet last run on analytics1026 is OK: OK: Puppet is currently enabled, last run 9 seconds ago with 0 failures [15:59:39] <grrrit-wm> (03PS1) 10Muehlenhoff: Update to 3.19.8-ckt10 [debs/linux] - 10https://gerrit.wikimedia.org/r/262457 [16:00:05] <jouncebot> anomie ostriches thcipriani marktraceur Krenair: Respected human, time to deploy Morning SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160105T1600). Please do the needful. [16:04:49] <icinga-wm> RECOVERY - puppet last run on analytics1027 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [16:12:29] <icinga-wm> RECOVERY - puppet last run on analytics1001 is OK: OK: Puppet is currently enabled, last run 17 seconds ago with 0 failures [16:24:09] <icinga-wm> PROBLEM - Hadoop NodeManager on analytics1030 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager [16:25:29] <wikibugs> 6operations, 10ops-codfw, 10netops: patch/implement new zayo wave (579171) codfw-ulsfo cr1-codfw:xe-5/0/2 - https://phabricator.wikimedia.org/T122823#1916225 (10Papaul) a:5Papaul>3None cable serial # 11541 [16:26:04] <wikibugs> 6operations, 10ops-codfw, 10netops: patch/implement new zayo wave (579171) codfw-ulsfo cr1-codfw:xe-5/0/2 - https://phabricator.wikimedia.org/T122823#1916227 (10Papaul) a:3faidon [16:47:26] <icinga-wm> RECOVERY - Hadoop NodeManager on analytics1030 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager [16:48:53] <grrrit-wm> (03PS2) 10Hashar: ci: remove elasticsearch from browsertest slaves [puppet] - 10https://gerrit.wikimedia.org/r/259301 (https://phabricator.wikimedia.org/T89083) (owner: 10Chad) [16:51:49] <grrrit-wm> (03CR) 10Hashar: [C: 031] "Cherry picked on CI puppet master. Did the cleanup via salt." [puppet] - 10https://gerrit.wikimedia.org/r/259301 (https://phabricator.wikimedia.org/T89083) (owner: 10Chad) [17:00:04] <jouncebot> Deploy window Puppet SWAT(Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160105T1700) [17:10:10] <Coren> Gaaah! Not being able to switch languages when not logged on is a nightmare! [17:14:39] * Coren beats his head against mediawiki language selection. [17:23:23] <icinga-wm> PROBLEM - puppet last run on mw2107 is CRITICAL: CRITICAL: puppet fail [17:23:46] <Nemo_bis> Coren: award a token to https://phabricator.wikimedia.org/T58464 :) [17:31:12] * Coren attempts to figure out what the commons hack is and how he could use it as a stopgap. [17:36:55] <wikibugs> 6operations, 10ops-codfw, 10netops: patch/implement new zayo wave (579171) codfw-ulsfo cr1-codfw:xe-5/0/2 - https://phabricator.wikimedia.org/T122823#1916290 (10RobH) a:5faidon>3None [17:37:16] <wikibugs> 6operations, 10ops-codfw, 10netops: patch/implement new zayo wave (579171) codfw-ulsfo cr1-codfw:xe-5/0/2 - https://phabricator.wikimedia.org/T122823#1916295 (10RobH) 5Open>3Resolved a:3RobH this is resolving, and I'll create a netops task for the implementation for this link (once both sides are compl... [17:40:18] <wikibugs> 6operations, 10netops: turn up/implement new zayo wave (579171) for ulsfo-codfw - https://phabricator.wikimedia.org/T122885#1916306 (10RobH) 3NEW [17:41:46] <wikibugs> 6operations, 10netops: turn up/implement new zayo wave (579171) for ulsfo-codfw - https://phabricator.wikimedia.org/T122885#1916306 (10RobH) [17:42:00] <wikibugs> 6operations, 10netops: turn up/implement new zayo wave (579171) for ulsfo-codfw - https://phabricator.wikimedia.org/T122885#1916306 (10RobH) [17:42:51] <wikibugs> 6operations, 10ops-codfw, 10netops: patch/implement new zayo wave (579171) codfw-ulsfo cr1-codfw:xe-5/0/2 - https://phabricator.wikimedia.org/T122823#1916324 (10RobH) [17:42:53] <wikibugs> 6operations, 10netops: turn up/implement new zayo wave (579171) for ulsfo-codfw - https://phabricator.wikimedia.org/T122885#1916323 (10RobH) [17:43:50] <Nemo_bis> Coren: https://commons.wikimedia.org/wiki/MediaWiki:AnonymousI18N.js [17:43:57] <Nemo_bis> And yes, it's used on multiple wikis. [17:44:32] <Nemo_bis> I guess sysops need to enable it on a few hundreds wikis before ops realise the cat is out of the bag. ;) [17:44:46] <Coren> Goodie. That saves my bacon for wikimania2017wiki [17:45:03] <Coren> (I was already following the uselang= tracks) [17:45:10] <wikibugs> 6operations, 10netops: turn-up/implement zayo wave (579171) for ulsfo-codfw - https://phabricator.wikimedia.org/T122885#1916330 (10RobH) a:3faidon [17:49:02] <Coren> OMG referrerWikiUselang(). Hack is right (though afaict, this'll do the right thing 99% of the time) [17:52:14] <icinga-wm> RECOVERY - puppet last run on mw2107 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [18:12:25] <wikibugs> 6operations, 10netops: Orange S.A. searches a contact at WMF for tests - https://phabricator.wikimedia.org/T122293#1916362 (10Dzahn) >>! In T122293#1916177, @Sylvain_WMFr wrote: > So, noc@wikimedia.org? Yes, that is correct, it would also work @wikipedia.org but get forwarded to wikimedia. [18:15:12] <grrrit-wm> (03PS2) 10Dzahn: Add amire80 to statistics-users for quering mysql analytics-slave [puppet] - 10https://gerrit.wikimedia.org/r/261217 (https://phabricator.wikimedia.org/T122524) (owner: 10Jcrespo) [18:18:04] <grrrit-wm> (03CR) 10Dzahn: [C: 032] "jcrespo's -2 was based on missing approval and waiting period. both are ok now, so, i've removed it and going ahead to resolve this" [puppet] - 10https://gerrit.wikimedia.org/r/261217 (https://phabricator.wikimedia.org/T122524) (owner: 10Jcrespo) [18:22:15] <mutante> @seen halfak [18:22:15] <wm-bot> mutante: Last time I saw halfak they were quitting the network with reason: Ping timeout: 240 seconds N/A at 1/5/2016 5:07:26 PM (1h14m48s ago) [18:26:12] <YuviPanda> mutante: he's doing a session right now at wmds [18:27:51] <wikibugs> 10Ops-Access-Requests, 6operations, 10Analytics, 10ContentTranslation-Analytics, and 2 others: access for amire80 to stat1002.eqiad.wmnet - https://phabricator.wikimedia.org/T122524#1916377 (10Dzahn) >>! In T122524#1906673, @Nuria wrote: > I think Amire80 might alreday have acess to 1002, i seem to remembe... [18:28:35] <mutante> YuviPanda: ah, thanks [18:28:46] <Coren> Nemo_bis: Almost works. Special:MyLanguage gets confused though. [18:28:48] * Coren digs further. [18:29:07] <Nemo_bis> Coren: perhaps you need to configure Translate https://www.mediawiki.org/wiki/Help:Extension:Translate/Configuration#Page_translation_feature [18:33:05] <Coren> Nemo_bis: Ah, not, it's just being smarter than I expected by not trying to show a translation that does not exist yet. [18:33:47] <wikibugs> 6operations, 10netops: Orange S.A. searches a contact at WMF for tests - https://phabricator.wikimedia.org/T122293#1916389 (10faidon) >>! In T122293#1916177, @Sylvain_WMFr wrote: > So, noc@wikimedia.org? Yes. Feel free to give them my name/email (faidon@) as well if you want to make it more personal. [18:34:10] <logmsgbot> !log jzerebecki@tin Started scap: deploy-log [18:34:14] <morebots> Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [18:34:15] <logmsgbot> !log jzerebecki@tin scap aborted: deploy-log (duration: 00m 04s) [18:34:19] <morebots> Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [18:35:51] <jzerebecki> sorry for that [18:36:02] <grrrit-wm> (03PS1) 10Papaul: Create raid0-lvm-srv.cfg partman file Bug:T121879 [puppet] - 10https://gerrit.wikimedia.org/r/262518 (https://phabricator.wikimedia.org/T121879) [18:40:17] <wikibugs> 10Ops-Access-Requests, 6operations, 10Analytics, 10ContentTranslation-Analytics, and 2 others: access for amire80 to stat1002.eqiad.wmnet - https://phabricator.wikimedia.org/T122524#1916417 (10Dzahn) amire80 has been added to the group statistics-users on stat100**3** [stat1003:~] $ id amire80 uid=2076(am... [18:44:51] <grrrit-wm> (03PS3) 10Dzahn: Update wikimania redirects to 2016 [puppet] - 10https://gerrit.wikimedia.org/r/260593 (https://phabricator.wikimedia.org/T122207) (owner: 10Alex Monk) [18:45:32] <grrrit-wm> (03CR) 10Dzahn: [C: 032] "well, it's 2016 now. changes just the number 5 to 6" [puppet] - 10https://gerrit.wikimedia.org/r/260593 (https://phabricator.wikimedia.org/T122207) (owner: 10Alex Monk) [18:47:21] <Robh> mutante: i wondered when that was giong to happen [18:49:39] <mutante> yea, there was no specific plan, i guess technically New Years's eve but that was bad :) [18:50:34] <mutante> i remember the year before we had the same update [18:51:57] <grrrit-wm> (03PS2) 10Papaul: Create raid0-lvm-srv.cfg partman file Bug:T121879: Change-Id: Iccc6b8be5b1e8b06b1f6787463a8cdee72d101cc [puppet] - 10https://gerrit.wikimedia.org/r/262518 [18:55:11] <mutante> "when wikimania2017 is installed already, it's time to switch wikimania.org to wikimania2016" [18:55:51] <grrrit-wm> (03CR) 10Dzahn: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/260593 (https://phabricator.wikimedia.org/T122207) (owner: 10Alex Monk) [18:57:45] <grrrit-wm> (03PS3) 10Papaul: :Create raid0-lvm-srv.cfg partman file Bug:T121879: [puppet] - 10https://gerrit.wikimedia.org/r/262518 [19:00:30] <ostriches> mutante, Krenair: Bonus points if we could use something like ${ENV:YEAR} :p [19:03:40] <ostriches> %{TIME_YEAR}? [19:05:44] <mutante> i'm not sure since this is the thing where .conf gets generated from .dat [19:08:21] <mutante> jenkins,plz [19:13:36] <wikibugs> 6operations, 10ops-codfw, 5Patch-For-Review: return pollux to spares - https://phabricator.wikimedia.org/T117423#1916561 (10Papaul) [19:15:02] <wikibugs> 6operations, 10ops-codfw, 5Patch-For-Review: return pollux to spares - https://phabricator.wikimedia.org/T117423#1916562 (10Papaul) 5Open>3Resolved This is complete [19:15:20] <ostriches> mutante: We could tweak the script :p [19:16:09] <victorbarbu_> hi [19:18:44] <victorbarbu_> anyone around here? [19:18:49] <Krenair> yes [19:19:19] <victorbarbu_> I guess you're not a server guy, since we've just talked earlier, are you? [19:19:19] <victorbarbu_> :D [19:25:16] <myrcx> victorbarbu_ hullo? [19:27:42] <mutante> !ask [19:27:42] <wm-bot> Hi, how can we help you? Just ask your question. [19:28:00] <grrrit-wm> (03CR) 10Muehlenhoff: [C: 032 V: 032] Update to 3.19.8-ckt10 [debs/linux] - 10https://gerrit.wikimedia.org/r/262457 (owner: 10Muehlenhoff) [19:31:33] <grrrit-wm> (03PS4) 10Papaul: :Create raid0-lvm-srv.cfg partman file Bug:T121879: [puppet] - 10https://gerrit.wikimedia.org/r/262518 [19:34:29] <victorbarbu_> My problem is not wm related, but it's still servers matter and I can't find anyone to help me [19:34:40] <victorbarbu_> I've installed nginx 1.4.6 and hhvm 3.11 on my ubuntu 14.04 x64 [19:35:25] <victorbarbu_> the problem is that, when I browse to localhost, I get 504 (gateway timeout) and then 502 [19:35:34] <victorbarbu_> and I've read the error logs and I found this [19:36:13] <victorbarbu_> https://dpaste.de/5mrX [19:36:20] <myrcx> victorbarbu_: not trying to boot you off elsewhere, but #nginx exists :3 [19:36:20] <victorbarbu_> I simply don't know what to do to solve it [19:36:28] <victorbarbu_> I tried [19:36:34] <victorbarbu_> they said they don't know anything about hhvm [19:36:43] <victorbarbu_> since wm uses hhvm [19:36:57] <victorbarbu_> and here may be server specialists... [19:37:02] <myrcx> fair enough :) I'll have a look at the paste [19:37:03] <victorbarbu_> I thought it's my last chance [19:37:43] <mutante> victorbarbu_: have you tried restartinv the hhvm service yet [19:37:45] <victorbarbu_> thank you, myrcx [19:37:59] <victorbarbu_> mutante, like 10 times [19:38:49] <grrrit-wm> (03PS1) 10Papaul: Adding install params for pc200[4-6] Bug:T121879 [puppet] - 10https://gerrit.wikimedia.org/r/262587 (https://phabricator.wikimedia.org/T121879) [19:40:30] <mutante> victorbarbu_: what do you get for "file /var/run/hhvm/hhvm.sock"? does it exist? [19:40:55] <myrcx> victorbarbu_: what's your fastcgi_read_timeout set to? [19:41:04] <victorbarbu_> it exists, yes [19:42:29] <victorbarbu_> myrcx, it's not anywhere in the configuration files of nginx [19:43:17] <grrrit-wm> (03CR) 10Hoo man: "Fine to deploy without the referenced dcat change (manually verified)." [puppet] - 10https://gerrit.wikimedia.org/r/262423 (https://phabricator.wikimedia.org/T118397) (owner: 10Lokal Profil) [19:43:51] <myrcx> add it to your location for hhvm [19:45:31] <victorbarbu_> value? [19:46:08] <wikibugs> 7Puppet, 6operations: puppet compiler runs fail when backup::host is included on host - https://phabricator.wikimedia.org/T122909#1916608 (10Dzahn) 3NEW [19:46:09] <myrcx> eh, go for 200 :P [19:46:24] <victorbarbu_> is that seconds? [19:46:55] <wikibugs> 7Puppet, 6operations: puppet compiler runs fail when backup::host is included on host - https://phabricator.wikimedia.org/T122909#1916615 (10Dzahn) [19:47:30] <myrcx> yeah, so: fastcgi_read_timeout 200; [19:50:41] <grrrit-wm> (03PS3) 10Dzahn: contint: drop publish-console [puppet] - 10https://gerrit.wikimedia.org/r/260190 [19:50:50] <grrrit-wm> (03CR) 10Dzahn: [C: 032] contint: drop publish-console [puppet] - 10https://gerrit.wikimedia.org/r/260190 (owner: 10Dzahn) [19:53:26] <victorbarbu_> doesn't seem to do anything [19:53:43] <myrcx> bugger - same error? [19:54:36] <victorbarbu_> not yet, but it's probably going to take 200 seconds to give me 504 [19:55:14] <mutante> how about /var/log/hhvm/ ? any error. or stacktrace.log files there? [19:58:24] <victorbarbu_> nothing related to server [19:58:34] <victorbarbu_> it's empty [19:59:36] <mutante> hmm. can you add config options to make sure it's logging there and maybe up the verbosity [19:59:45] <mutante> victorbarbu_: maybe it logs to /var/log/syslog ? [20:00:04] <victorbarbu_> if this is set by default, probably, [20:00:07] <victorbarbu_> I will check it out [20:01:15] <victorbarbu_> no "hhvm" in there [20:02:32] <grrrit-wm> (03CR) 10Dzahn: "i don't see that file "raid0-lvm-srv.cfg" (yet), just raid0-lvm.cfg. Are you adding that as a new file?" [puppet] - 10https://gerrit.wikimedia.org/r/262587 (https://phabricator.wikimedia.org/T121879) (owner: 10Papaul) [20:04:17] <papaul> dzahn: it is a new file [20:04:57] <mutante> victorbarbu_: ps aux | grep hhvm ? do you see it running ? [20:05:07] <mutante> papaul: is that a second change? [20:06:30] <grrrit-wm> (03CR) 10Dzahn: "this was originally about renaming classes with dash, amended by hashar since this class is not used at all. thanks" [puppet] - 10https://gerrit.wikimedia.org/r/260190 (owner: 10Dzahn) [20:06:39] <papaul> mutante: ?? [20:06:51] <victorbarbu_> mutante, https://dpaste.de/Ts38 [20:06:52] <hashar> :) [20:07:27] <mutante> papaul: are you still writing raid0-lvm-srv.cfg and it's a separate upload to gerrit? [20:07:34] <mutante> hashar: :) thx [20:07:39] <papaul> mutante: is it done [20:08:55] <mutante> victorbarbu_: i'm noticing this part there "--config /etc/hhvm/php.ini " .. php.ini. do you also have /etc/hhvm/fcgi.ini ? because in your original error messages it says fcgi [20:09:24] <mutante> on a random wmf server: hhvm --config /etc/hhvm/fcgi.ini --mode server [20:09:59] <mutante> papaul: i pulled to see if it got added and i could not find it [20:11:13] <wikibugs> 7Puppet, 10Deployment-Systems, 5Patch-For-Review, 3Scap3: Move scap.cfg things out of scap and into puppet - https://phabricator.wikimedia.org/T121435#1916666 (10greg) [20:11:24] <Robh> papaul: ohh [20:11:29] <Robh> you have to git add <filename> [20:11:31] <papaul> mutante: let me check that [20:11:34] <Robh> and then git commit --amend -a [20:11:42] <Robh> in your local repo [20:11:51] <Robh> i forgot to tell you the git add part, sorry =] [20:12:08] <victorbarbu_> nope [20:12:08] <Robh> oh [20:12:13] <Robh> papaul: you made two patchsets, nm [20:12:17] <Robh> forget what i just said [20:12:18] <victorbarbu_> there's not fcgi.ini [20:12:23] <Robh> papaul: you could have made that a single patchset =] [20:12:26] <mutante> that was the question, is it a second patchset or not [20:12:27] <mutante> either works [20:12:40] <Robh> mutante: indeed [20:12:40] <Robh> https://gerrit.wikimedia.org/r/#/c/262518/ [20:12:43] <Robh> so it works =] [20:13:00] <grrrit-wm> (03PS5) 10RobH: :Create raid0-lvm-srv.cfg partman file Bug:T121879: [puppet] - 10https://gerrit.wikimedia.org/r/262518 (owner: 10Papaul) [20:13:28] <mutante> cool, then merge that one first. i would have done the other but when the file exists [20:14:49] <grrrit-wm> (03CR) 10RobH: [C: 032] :Create raid0-lvm-srv.cfg partman file Bug:T121879: [puppet] - 10https://gerrit.wikimedia.org/r/262518 (owner: 10Papaul) [20:15:01] <Robh> mutante: merged, you can push the other [20:15:17] <grrrit-wm> (03PS2) 10Dzahn: Adding install params for pc200[4-6] Bug:T121879 [puppet] - 10https://gerrit.wikimedia.org/r/262587 (https://phabricator.wikimedia.org/T121879) (owner: 10Papaul) [20:15:21] <wikibugs> 6operations, 10Datasets-General-or-Unknown: Provide a good download service of dumps from Wikimedia - https://phabricator.wikimedia.org/T122917#1916695 (10ArielGlenn) 3NEW [20:15:25] <grrrit-wm> (03CR) 10Dzahn: [C: 032] Adding install params for pc200[4-6] Bug:T121879 [puppet] - 10https://gerrit.wikimedia.org/r/262587 (https://phabricator.wikimedia.org/T121879) (owner: 10Papaul) [20:15:41] <Robh> papaul: so now you have your first partman recipe \o/ [20:16:21] <victorbarbu_> myrcx, ping [20:16:28] <wikibugs> 6operations, 10Datasets-General-or-Unknown, 10netops: dumps.wikimedia.org seems to have poor networking towards Telia - https://phabricator.wikimedia.org/T120425#1916703 (10ArielGlenn) [20:16:30] <wikibugs> 6operations: bond eth interfaces on ms1001 - https://phabricator.wikimedia.org/T89829#1916704 (10ArielGlenn) [20:16:31] <papaul> Robh: mutante i stay trying to understand what happen [20:16:32] <wikibugs> 6operations, 10Datasets-General-or-Unknown: Sometimes (at peak usage?), dumps.wikimedia.org becomes very slow for users (sometimes unresponsive) - https://phabricator.wikimedia.org/T45647#1916705 (10ArielGlenn) [20:16:33] <wikibugs> 6operations, 10Datasets-General-or-Unknown: Provide a good download service of dumps from Wikimedia - https://phabricator.wikimedia.org/T122917#1916702 (10ArielGlenn) [20:16:46] <Robh> papaul: so it looks like you just had two patchsets into gerrit is all [20:16:51] <Robh> and we each expected a single patchset with all the changes [20:16:52] <wikibugs> 6operations, 10Datasets-General-or-Unknown: Provide a good download service of dumps from Wikimedia - https://phabricator.wikimedia.org/T122917#1916695 (10ArielGlenn) [20:16:55] <Robh> no big deal =] [20:18:23] <Robh> oh, let me make sure the network ports are done for those [20:20:02] <mutante> papaul: i just saw one of the 2 gerrit links so i thought "i dont wanna merge it until the .cfg file it refers to actually exists" that's all. so it was just about the order of merging things. or combining it in a single change with --amend [20:20:06] <Robh> (they arent doing them now) [20:20:33] <papaul> mutante: ok thanks [20:20:48] <papaul> Robh: i am working on linux-host-entries.ttyS1-115200 now [20:20:54] <mutante> papaul: try running puppet on carbon now [20:21:02] <Robh> i dont think he can force that can he? [20:21:03] <Robh> i think we have to [20:21:09] <mutante> i was about to ask that [20:21:19] <mutante> yea.. [20:21:28] <Robh> he has other patches to go too [20:21:31] <Robh> so no need to force yet [20:21:35] <mutante> ok [20:21:36] <Robh> (still need the dhcp update) [20:21:45] <papaul> Robh: yes working on that [20:21:53] <Robh> yep [20:26:35] <Robh> all the switch ports are setup [20:26:45] <wikibugs> 6operations, 10ops-codfw: rack/setup pc2004-2006 - https://phabricator.wikimedia.org/T121879#1916708 (10RobH) switch ports allocated and setup [20:26:49] <grrrit-wm> (03PS4) 10Madhuvishy: [WIP] wikimetrics: Puppet module for wikimetrics [puppet] - 10https://gerrit.wikimedia.org/r/260687 [20:27:43] <grrrit-wm> (03CR) 10jenkins-bot: [V: 04-1] [WIP] wikimetrics: Puppet module for wikimetrics [puppet] - 10https://gerrit.wikimedia.org/r/260687 (owner: 10Madhuvishy) [20:29:20] <grrrit-wm> (03PS3) 10Dzahn: add several parked domains [dns] - 10https://gerrit.wikimedia.org/r/260706 (https://phabricator.wikimedia.org/T121914) [20:30:10] <grrrit-wm> (03PS4) 10Dzahn: add several parked domains [dns] - 10https://gerrit.wikimedia.org/r/260706 (https://phabricator.wikimedia.org/T121914) [20:30:35] <icinga-wm> PROBLEM - Host mw2131 is DOWN: PING CRITICAL - Packet loss = 100% [20:32:01] <grrrit-wm> (03CR) 10Dzahn: "none of these work before or will work after as in getting traffic. it's just about having valid zones at all." [dns] - 10https://gerrit.wikimedia.org/r/260706 (https://phabricator.wikimedia.org/T121914) (owner: 10Dzahn) [20:32:08] <grrrit-wm> (03CR) 10Dzahn: [C: 032] "none of these work before or will work after as in getting traffic. it's just about having valid zones at all." [dns] - 10https://gerrit.wikimedia.org/r/260706 (https://phabricator.wikimedia.org/T121914) (owner: 10Dzahn) [20:32:55] <grrrit-wm> (03PS5) 10Hashar: tox entry point to run pep8==1.4.6 [puppet] - 10https://gerrit.wikimedia.org/r/244148 (https://phabricator.wikimedia.org/T114887) [20:32:57] <grrrit-wm> (03PS1) 10Hashar: base: fix missing whitespaces in check_conntrack.py [puppet] - 10https://gerrit.wikimedia.org/r/262593 [20:32:59] <grrrit-wm> (03PS1) 10Hashar: elasticsearch: lint check_elasticsearch.py [puppet] - 10https://gerrit.wikimedia.org/r/262594 [20:33:01] <grrrit-wm> (03PS1) 10Hashar: interface: lint interface-rps.py [puppet] - 10https://gerrit.wikimedia.org/r/262595 [20:33:03] <grrrit-wm> (03PS1) 10Hashar: toollabs: lint genpp.py [puppet] - 10https://gerrit.wikimedia.org/r/262596 [20:33:05] <grrrit-wm> (03PS1) 10Hashar: varnish: lint varnishlog.py [puppet] - 10https://gerrit.wikimedia.org/r/262597 [20:35:22] <grrrit-wm> (03PS1) 10Hashar: Get rid of .pep8 files [puppet] - 10https://gerrit.wikimedia.org/r/262598 (https://phabricator.wikimedia.org/T114887) [20:36:13] <grrrit-wm> (03CR) 10Hashar: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/244148 (https://phabricator.wikimedia.org/T114887) (owner: 10Hashar) [20:36:29] <grrrit-wm> (03CR) 10Hashar: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/262598 (https://phabricator.wikimedia.org/T114887) (owner: 10Hashar) [20:40:34] <hashar> bunch of python linting changes ^^^^ :D [20:46:18] <icinga-wm> RECOVERY - Host mw2131 is UP: PING OK - Packet loss = 0%, RTA = 36.51 ms [20:47:38] <icinga-wm> PROBLEM - Host mw2132 is DOWN: PING CRITICAL - Packet loss = 100% [20:48:11] <greg-g> ?? ^ re the dallas mws [20:49:00] <greg-g> mutante: is something going on that would cause those flaps or ? [20:49:20] <Robh> it seems like its just a single mw host right? mw2131 [20:49:28] <icinga-wm> PROBLEM - Host mw2133 is DOWN: PING CRITICAL - Packet loss = 100% [20:49:28] <icinga-wm> RECOVERY - Host mw2132 is UP: PING OK - Packet loss = 0%, RTA = 36.50 ms [20:49:31] <greg-g> and 32 [20:49:35] <Robh> nope, two of them... [20:49:39] <mutante> greg-g: no, they are actually being rebooted [20:49:44] <greg-g> kk [20:49:44] <mutante> it seems somebody is doing upgrades [20:49:45] <Robh> oh, ok [20:49:46] * greg-g ignores [20:49:48] <mutante> but not me [20:49:50] <mutante> and i didnt know [20:50:02] <Robh> grr [20:50:08] <Robh> why isnt there a log from whoever is doing it? https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:50:36] <hashar> Dallas mw apps aren't serving any traffic anyway do they? [20:50:58] <Robh> if you arent going to put it into maint mode (so icinga doesnt flap) you should still log it [20:51:10] <Robh> otherwise folks start diong what we did, wondering whats up and possibly investigating [20:51:24] <mutante> +1 [20:51:28] <Robh> hashar: they arent to my knowledge [20:51:47] <Robh> but we have to treat all icinga errors as legitimate or it becomes useless [20:52:07] <mutante> exactly that [20:52:09] * Robh goes through the soapbox motions [20:53:39] <icinga-wm> RECOVERY - Host mw2133 is UP: PING OK - Packet loss = 0%, RTA = 36.78 ms [20:54:09] <icinga-wm> PROBLEM - puppet last run on db2061 is CRITICAL: CRITICAL: puppet fail [20:55:28] <hashar> mw2133 got rebooted [20:55:36] <mutante> by who? [20:55:49] <hashar> no clue I have no root to look at log [20:55:55] <hashar> reboot system boot 3.13.0-24-generi Tue Jan 5 20:53 - 20:55 (00:02) [20:57:12] <hashar> maybe they are being installed ? [20:57:28] <Robh> it would be reinstalled then [20:57:38] <Robh> as they have icinga monitoring already (hence alarms, so they were already installed) [20:57:44] <Robh> and hten should be in maint or have SAL entry ;] [20:57:56] <mutante> @seen moritzm [20:57:56] <wm-bot> mutante: I have never seen moritzm [21:00:45] <wikibugs> 6operations, 10ops-codfw, 10hardware-requests: wipe disks and add nembus back to server spares - https://phabricator.wikimedia.org/T122100#1916795 (10Papaul) [21:00:48] <Robh> well, last shows the system calling a reboot [21:01:00] <Robh> which could be due to any number of upgrades to kernel or other items... [21:01:12] <wikibugs> 6operations, 10hardware-requests: Decomission/reclaim nembus/neptunium - https://phabricator.wikimedia.org/T122050#1916798 (10Papaul) [21:01:17] <Robh> or salt. [21:07:37] <mutante> i also thought kernel upgrade but the version is the same as on one that is running a long time [21:08:06] <mutante> and yea, must be salt [21:12:12] <Robh> yea its annoying as fuck that folks are not logging their shit [21:12:16] <Robh> =P [21:12:46] <Robh> In the past that would be grounds to get a bunch of angry opsen shaking fists ;] [21:19:14] <icinga-wm> RECOVERY - puppet last run on db2061 is OK: OK: Puppet is currently enabled, last run 3 seconds ago with 0 failures [21:30:41] <grrrit-wm> (03PS4) 10Dzahn: ores: monitor workers without service reloads [puppet] - 10https://gerrit.wikimedia.org/r/262418 (https://phabricator.wikimedia.org/T122830) [21:31:27] <grrrit-wm> (03CR) 10Dzahn: [C: 032] ores: monitor workers without service reloads [puppet] - 10https://gerrit.wikimedia.org/r/262418 (https://phabricator.wikimedia.org/T122830) (owner: 10Dzahn) [21:42:16] <grrrit-wm> (03PS1) 10Papaul: Add pc200[4-6] MAC address entries Bug:T121879 [puppet] - 10https://gerrit.wikimedia.org/r/262669 (https://phabricator.wikimedia.org/T121879) [21:50:14] <icinga-wm> PROBLEM - Kafka Broker Replica Max Lag on kafka1018 is CRITICAL: CRITICAL: 60.87% of data above the critical threshold [5000000.0] [21:51:34] <icinga-wm> PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 44.44% of data above the critical threshold [1000.0] [21:54:45] <icinga-wm> PROBLEM - Codfw HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [1000.0] [21:56:54] <icinga-wm> RECOVERY - Codfw HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [21:58:04] <icinga-wm> RECOVERY - Text HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [22:03:14] <icinga-wm> RECOVERY - Kafka Broker Replica Max Lag on kafka1018 is OK: OK: Less than 50.00% above the threshold [1000000.0] [22:05:01] <grrrit-wm> (03PS1) 10Chad: Use %{YEAR_DATE} instead of updating Wikimania redirects every year [puppet] - 10https://gerrit.wikimedia.org/r/262670 [22:05:19] <wikibugs> 10Ops-Access-Requests, 6operations: Access for new Analytics Opsen: Luca Toscano - https://phabricator.wikimedia.org/T122925#1916866 (10Ottomata) 3NEW [22:05:29] <ostriches> Heh, that was easy [22:05:37] <ostriches> %{YEAR_DATE} just got passed straight through [22:06:25] <wikibugs> 10Ops-Access-Requests, 6operations: Access for new Analytics Opsen: Luca Toscano - https://phabricator.wikimedia.org/T122925#1916885 (10Ottomata) As this is a new hire, do we need to wait 3 days? Since some of the groups do grant limited sudo access, it will be fine if we wait for an ops meeting before we add... [22:06:57] <grrrit-wm> (03PS2) 10Chad: Use %{YEAR_DATE} instead of updating Wikimania redirects every year [puppet] - 10https://gerrit.wikimedia.org/r/262670 [22:08:04] <icinga-wm> PROBLEM - Check correctness of the icinga configuration on neon is CRITICAL: Icinga configuration contains errors [22:15:04] <mutante> i'm fixing that ^ [22:15:11] <grrrit-wm> (03PS1) 10Dzahn: icinga/ores: add check command definition, use nagios_common [puppet] - 10https://gerrit.wikimedia.org/r/262673 [22:16:00] <grrrit-wm> (03CR) 10Alex Monk: "Ping" [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/248634 (owner: 10Alex Monk) [22:17:57] <grrrit-wm> (03PS1) 10Andrew Bogott: Don't use an http proxy for vmbuilder. [puppet] - 10https://gerrit.wikimedia.org/r/262674 [22:20:31] <grrrit-wm> (03CR) 10Andrew Bogott: [C: 032] Don't use an http proxy for vmbuilder. [puppet] - 10https://gerrit.wikimedia.org/r/262674 (owner: 10Andrew Bogott) [22:23:01] <wikibugs> 6operations, 6Labs, 10wikitech.wikimedia.org: Rename specific account in LDAP, Wikitech, Gerrit and Phabricator - https://phabricator.wikimedia.org/T85913#1916956 (10Aklapper) #operations: If nobody feels comfortable going ahead and rename because nobody understands all the potential negative side effects, w... [22:25:44] <grrrit-wm> (03PS2) 10Dzahn: icinga/ores: add check command definition, use nagios_common [puppet] - 10https://gerrit.wikimedia.org/r/262673 (https://phabricator.wikimedia.org/T122830) [22:25:58] <grrrit-wm> (03PS3) 10Dzahn: icinga/ores: add check command definition, use nagios_common [puppet] - 10https://gerrit.wikimedia.org/r/262673 (https://phabricator.wikimedia.org/T122830) [22:26:25] <grrrit-wm> (03CR) 10Dzahn: [C: 032] icinga/ores: add check command definition, use nagios_common [puppet] - 10https://gerrit.wikimedia.org/r/262673 (https://phabricator.wikimedia.org/T122830) (owner: 10Dzahn) [22:35:26] <grrrit-wm> (03CR) 10Aaron Schulz: [C: 032] Remove redundant RunJobs code [mediawiki-config] - 10https://gerrit.wikimedia.org/r/261304 (owner: 10Aaron Schulz) [22:36:07] <grrrit-wm> (03Merged) 10jenkins-bot: Remove redundant RunJobs code [mediawiki-config] - 10https://gerrit.wikimedia.org/r/261304 (owner: 10Aaron Schulz) [22:36:09] <grrrit-wm> (03PS5) 10Alex Monk: Do not rewrite https -> http for IRC notifications [mediawiki-config] - 10https://gerrit.wikimedia.org/r/217858 (https://phabricator.wikimedia.org/T122933) (owner: 10Faidon Liambotis) [22:36:17] <grrrit-wm> (03PS6) 10Alex Monk: Do not rewrite https -> http for IRC notifications [mediawiki-config] - 10https://gerrit.wikimedia.org/r/217858 (https://phabricator.wikimedia.org/T122933) (owner: 10Faidon Liambotis) [22:36:25] <grrrit-wm> (03CR) 10jenkins-bot: [V: 04-1] Do not rewrite https -> http for IRC notifications [mediawiki-config] - 10https://gerrit.wikimedia.org/r/217858 (https://phabricator.wikimedia.org/T122933) (owner: 10Faidon Liambotis) [22:36:28] <paravoid> Krenair: are you merging that? [22:36:35] <paravoid> there was significant pushback on it [22:36:45] <Krenair> no [22:37:17] <Krenair> not now anyway [22:37:21] <James_F> *yet [22:38:01] <Krenair> I made a task [22:38:04] <Krenair> to get it announced [22:38:06] <logmsgbot> !log aaron@tin Synchronized rpc: 830e1ed8d80295710dc02f18102b4fadae7fca86 (duration: 00m 55s) [22:38:10] <morebots> Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [22:38:36] <icinga-wm> PROBLEM - puppet last run on mw1006 is CRITICAL: CRITICAL: Puppet has 2 failures [22:41:00] <Krenair> AaronSchulz, I thought we weren't doing such deploys this week? [22:41:24] <icinga-wm> RECOVERY - Check correctness of the icinga configuration on neon is OK: Icinga configuration is correct [22:42:17] <grrrit-wm> (03CR) 10Alex Monk: [C: 04-1] "Needs announcement, feel free to remove my CR-1 when that's done" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/208655 (https://phabricator.wikimedia.org/T94416) (owner: 10Aude) [22:42:42] <grrrit-wm> (03CR) 10Alex Monk: [C: 04-1] "waiting for tech news" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/217858 (https://phabricator.wikimedia.org/T122933) (owner: 10Faidon Liambotis) [22:43:45] <paravoid> Krenair: oh, grat, thanks :) [22:44:47] <paravoid> there was also some other change that required bots to be adjusted, I think hardcoded "pmtpa" or something? [22:45:06] <paravoid> if people are going to adjust their regexps, might just as well tell them to do both? [22:47:46] <Krenair> rc-pmtpa is the nickname used by the feed [22:48:26] <Krenair> that requires a puppet change [22:48:47] <Krenair> and would be applied separately to the https change and at a slightly different time etc. [22:49:35] <grrrit-wm> (03Abandoned) 10Aaron Schulz: Use rdb1005 as the primary job queue aggregator [mediawiki-config] - 10https://gerrit.wikimedia.org/r/258714 (owner: 10Ori.livneh) [22:49:41] <mutante> changing that nickname broke user tools [22:52:11] <Platonides> well, if we know which will be the new nickname, both changes could be set at the same time [22:52:23] <Platonides> I'm not convinced it needs to be done, though [22:53:00] <mutante> if we do that let's also add IPv6 [22:53:10] <mutante> it already has the mapped address on interface [22:53:19] <mutante> just the ircd isnt listening on it because we didnt want to restart [22:53:27] <mutante> and then it closes a 3rd open ticket [22:53:50] <grrrit-wm> (03CR) 10Alex Monk: "Do we have a schedule for getting this done?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/251678 (https://phabricator.wikimedia.org/T119100) (owner: 10CSteipp) [22:54:40] <Platonides> btw does it support ssl? [22:54:56] <Krenair> and also https://phabricator.wikimedia.org/T87780 [22:56:18] <mutante> Platonides: i dont think so, at least not on 6697 [22:56:49] <mutante> Krenair: well, the new stream can't replace the old stuff [22:56:55] <mutante> is what we were told [22:56:59] <Platonides> another thing that could be added when restarting :) [22:57:31] <mutante> i dont think it's technically deprecated [22:57:39] <mutante> been there [22:59:08] <mutante> Platonides: that will also need a puppet change to add ferm rules [22:59:17] <mutante> not just ircd config [23:02:10] <Platonides> nothing that can't be done ;) [23:02:46] <grrrit-wm> (03CR) 10Alex Monk: [C: 031] Use %{YEAR_DATE} instead of updating Wikimania redirects every year [puppet] - 10https://gerrit.wikimedia.org/r/262670 (owner: 10Chad) [23:03:52] <grrrit-wm> (03CR) 10Alex Monk: "(I can use a mouse)" [puppet] - 10https://gerrit.wikimedia.org/r/262670 (owner: 10Chad) [23:04:15] <icinga-wm> RECOVERY - puppet last run on mw1006 is OK: OK: Puppet is currently enabled, last run 30 seconds ago with 0 failures [23:06:08] <wikibugs> 10Ops-Access-Requests, 6operations: Access for new Analytics Opsen: Luca Toscano - https://phabricator.wikimedia.org/T122925#1917062 (10Krenair) bastiononly should no longer be necessary. eventlogging-admins and researchers are also non-sudo. @Ottomata: I don't think new foundation hires get an exception to... [23:09:41] <wikibugs> 10Ops-Access-Requests, 6operations: Access for new Analytics Opsen: Luca Toscano - https://phabricator.wikimedia.org/T122925#1917072 (10RobH) No one gets an exception to the 3 day wait, unless @mark specifically waives it as the head of ops. [23:28:31] <grrrit-wm> (03PS1) 10Dzahn: icinga/ores: put homemade plugins into /usr/local/ [puppet] - 10https://gerrit.wikimedia.org/r/262677 [23:29:47] <grrrit-wm> (03CR) 10jenkins-bot: [V: 04-1] icinga/ores: put homemade plugins into /usr/local/ [puppet] - 10https://gerrit.wikimedia.org/r/262677 (owner: 10Dzahn) [23:48:07] <grrrit-wm> (03PS2) 10Dzahn: icinga/ores: put homemade plugins into /usr/local/ [puppet] - 10https://gerrit.wikimedia.org/r/262677 [23:49:04] <grrrit-wm> (03CR) 10jenkins-bot: [V: 04-1] icinga/ores: put homemade plugins into /usr/local/ [puppet] - 10https://gerrit.wikimedia.org/r/262677 (owner: 10Dzahn) [23:50:10] <grrrit-wm> (03CR) 10Hoo man: "Not sure how localization updates are working for this. Before deploying this, we probably want to make sure that at least the most of the" [dumps/dcat] - 10https://gerrit.wikimedia.org/r/262422 (https://phabricator.wikimedia.org/T118397) (owner: 10Lokal Profil) [23:52:37] <grrrit-wm> (03CR) 10CSteipp: "Ah sorry. Waiting on the meta RFC [https://meta.wikimedia.org/wiki/Requests_for_comment/Password_policy_for_users_with_certain_advanced_pe" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/251678 (https://phabricator.wikimedia.org/T119100) (owner: 10CSteipp)