[00:58:36] <icinga-wm_>	 PROBLEM - puppet last run on cp2002 is CRITICAL: CRITICAL: puppet fail
[01:26:17] <icinga-wm_>	 RECOVERY - puppet last run on cp2002 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[02:08:26] <icinga-wm_>	 PROBLEM - puppet last run on lvs4004 is CRITICAL: CRITICAL: puppet fail
[02:21:42] <logmsgbot>	 !log l10nupdate@tin Synchronized php-1.27.0-wmf.5/cache/l10n: l10nupdate for 1.27.0-wmf.5 (duration: 06m 49s)
[02:21:49] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[02:36:08] <icinga-wm_>	 RECOVERY - puppet last run on lvs4004 is OK: OK: Puppet is currently enabled, last run 40 seconds ago with 0 failures
[05:18:37] <grrrit-wm>	 (03PS3) 10MaxSem: Switch www.wikimedia.org to source control [puppet] - 10https://gerrit.wikimedia.org/r/249009 (https://phabricator.wikimedia.org/T115964) 
[05:21:38] <icinga-wm_>	 PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 10 data above and 5 below the confidence bounds
[05:31:06] <icinga-wm_>	 PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 11 data above and 8 below the confidence bounds
[05:42:56] <legoktm>	 slave lag?
[05:42:57] <legoktm>	 Warning: The database has been locked for maintenance, so you will not be able to save your edits right now. You may wish to copy and paste your text into a text file and save it for later.
[05:42:57] <legoktm>	 The administrator who locked it offered this explanation: The database has been automatically locked while the slave database servers catch up to the master.
[05:43:20] <legoktm>	 its fine now
[05:49:27] <icinga-wm_>	 RECOVERY - HTTP error ratio anomaly detection on graphite1001 is OK: OK: No anomaly detected
[06:30:37] <icinga-wm_>	 PROBLEM - puppet last run on mw1008 is CRITICAL: CRITICAL: Puppet has 1 failures
[06:30:40] <cajoel>	 anyone toy'd with prometheus.io 
[06:30:46] <icinga-wm_>	 PROBLEM - puppet last run on db2055 is CRITICAL: CRITICAL: puppet fail
[06:31:17] <icinga-wm_>	 PROBLEM - puppet last run on mw2016 is CRITICAL: CRITICAL: Puppet has 1 failures
[06:31:26] <icinga-wm_>	 PROBLEM - puppet last run on cp2002 is CRITICAL: CRITICAL: Puppet has 1 failures
[06:31:47] <icinga-wm_>	 PROBLEM - puppet last run on cp3017 is CRITICAL: CRITICAL: Puppet has 1 failures
[06:32:18] <icinga-wm_>	 PROBLEM - puppet last run on mw2129 is CRITICAL: CRITICAL: Puppet has 1 failures
[06:56:27] <icinga-wm_>	 RECOVERY - puppet last run on mw1008 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[06:56:36] <icinga-wm_>	 RECOVERY - puppet last run on db2055 is OK: OK: Puppet is currently enabled, last run 42 seconds ago with 0 failures
[06:57:16] <icinga-wm_>	 RECOVERY - puppet last run on mw2016 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[06:57:17] <icinga-wm_>	 RECOVERY - puppet last run on cp2002 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[06:57:46] <icinga-wm_>	 RECOVERY - puppet last run on cp3017 is OK: OK: Puppet is currently enabled, last run 56 seconds ago with 0 failures
[06:58:08] <icinga-wm_>	 RECOVERY - puppet last run on mw2129 is OK: OK: Puppet is currently enabled, last run 56 seconds ago with 0 failures
[07:28:25] <SMalyshev>	 anybody knows what's going on with git deploy/git fat on tin? I try to deploy and it takes veyr long time and deployment does not work at the end
[07:28:43] <SMalyshev>	 git-fat file is not resolved and I get 0/2 minions completed fetch too
[07:40:37] <icinga-wm_>	 PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 18 data above and 8 below the confidence bounds
[07:51:38] <icinga-wm_>	 RECOVERY - HTTP error ratio anomaly detection on graphite1001 is OK: OK: No anomaly detected
[08:00:35] <grrrit-wm>	 (03CR) 10Muehlenhoff: [C: 031] "Looks good to me. @JanZerebecki ; that's not a problem as longas the identifiers of the ferm rules (like librenms-http) are unique. In fac" [puppet] - 10https://gerrit.wikimedia.org/r/251550 (https://phabricator.wikimedia.org/T105410) (owner: 10Dzahn)
[08:10:26] <icinga-wm_>	 PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 11 data above and 2 below the confidence bounds
[08:13:59] <grrrit-wm>	 (03CR) 10Muehlenhoff: [C: 032 V: 032] Point git buildpackage upstream to HEAD of repo [debs/debdeploy] - 10https://gerrit.wikimedia.org/r/251506 (owner: 10Hashar)
[08:15:56] <icinga-wm_>	 PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 11 data above and 2 below the confidence bounds
[08:36:42] <wikibugs>	 6operations, 6Release-Engineering-Team: deployment broken on wdqs1001 - https://phabricator.wikimedia.org/T118148#1792656 (10Smalyshev) 3NEW
[08:38:17] <icinga-wm_>	 RECOVERY - HTTP error ratio anomaly detection on graphite1001 is OK: OK: No anomaly detected
[08:59:15] <grrrit-wm>	 (03CR) 10TTO: "@Krenair: Please see comment on createTxtFileSymlinks" (033 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/157338 (https://phabricator.wikimedia.org/T17583) (owner: 10TTO)
[08:59:22] <grrrit-wm>	 (03PS9) 10TTO: Allow import from any Labs/Beta Cluster project to any other [mediawiki-config] - 10https://gerrit.wikimedia.org/r/157338 (https://phabricator.wikimedia.org/T17583) 
[08:59:54] <tto>	 ori: around?
[09:10:14] <dcausse>	 !log freezing elasticsearch indices in eqiad (test)
[09:10:17] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[09:18:45] <dcausse>	 !log restarting elastic on elastic1008.eqiad.wmnet (test)
[09:18:49] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[09:19:17] <dcausse>	 !log err (1007 no 1008): restarting elastic on elastic1007.eqiad.wmnet (test)
[09:19:20] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[09:29:40] <godog>	 !log swift codfw-prod: ms-be2016 / ms-be2018 / ms-be2020 weight 3000
[09:29:44] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[09:41:23] <grrrit-wm>	 (03CR) 10Muehlenhoff: [C: 032 V: 032] .gitreview file [debs/debdeploy] - 10https://gerrit.wikimedia.org/r/251505 (owner: 10Hashar)
[09:48:47] <wikibugs>	 6operations, 10Datasets-General-or-Unknown, 7HHVM, 5Patch-For-Review: Convert snapshot hosts to use HHVM and trusty - https://phabricator.wikimedia.org/T94277#1792716 (10ArielGlenn)
[09:50:05] <wikibugs>	 6operations, 10Dumps-Generation: make dumps easy to rerun or clean up - https://phabricator.wikimedia.org/T110876#1792720 (10ArielGlenn)
[09:55:05] <wikibugs>	 6operations, 10Dumps-Generation: make dumps easy to rerun or clean up - https://phabricator.wikimedia.org/T110876#1792726 (10ArielGlenn) That history rerun task needs to be updated and it doesn't block this work after all; we already rerun history jobs now automatically, as for example the previous month's dum...
[09:55:32] <wikibugs>	 6operations, 10Dumps-Generation: make dumps easy to rerun or clean up - https://phabricator.wikimedia.org/T110876#1792727 (10ArielGlenn) 5Open>3Resolved Ah and with that commit this task is now complete.
[09:57:34] <wikibugs>	 6operations: Puppet Compiler:  Support wildcards, regexps, or 'all hosts' - https://phabricator.wikimedia.org/T114305#1792731 (10Joe) p:5Triage>3Low
[09:57:50] <wikibugs>	 6operations, 10Beta-Cluster-Infrastructure, 7Blocked-on-RelEng, 7HHVM, 5Patch-For-Review: Convert work machines (tin, terbium) to Trusty and hhvm usage - https://phabricator.wikimedia.org/T87036#1792734 (10Joe)
[09:57:52] <wikibugs>	 7Blocked-on-Operations, 6operations, 7HHVM, 5Patch-For-Review: Reimage mw1152 as a terbium replacement - https://phabricator.wikimedia.org/T116728#1792733 (10Joe) 5Open>3Resolved
[10:02:02] <_joe_>	 !log repooled mw1061
[10:02:06] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[10:02:15] <wikibugs>	 6operations, 10ops-eqiad, 5Patch-For-Review: mw1061 has a faulty disk, filesystem is read-only - https://phabricator.wikimedia.org/T107849#1792750 (10Joe) 5Open>3Resolved
[10:02:16] <wikibugs>	 6operations: Investigate idle/depooled eqiad appservers - https://phabricator.wikimedia.org/T116256#1792751 (10Joe)
[10:04:13] <grrrit-wm>	 (03PS5) 10Filippo Giunchedi: restbase: move to systemd unit file [puppet] - 10https://gerrit.wikimedia.org/r/244647 (https://phabricator.wikimedia.org/T103134) 
[10:06:31] <grrrit-wm>	 (03PS1) 10ArielGlenn: dumps: more refactoring of classes [dumps] (ariel) - 10https://gerrit.wikimedia.org/r/251931 
[10:06:32] <grrrit-wm>	 (03PS1) 10ArielGlenn: dumps: split the huge jobs module into several manageable ones [dumps] (ariel) - 10https://gerrit.wikimedia.org/r/251932 
[10:13:16] <grrrit-wm>	 (03PS1) 10Filippo Giunchedi: base: further clarify service_unit ensure [puppet] - 10https://gerrit.wikimedia.org/r/251933 
[10:14:11] <grrrit-wm>	 (03CR) 10Filippo Giunchedi: restbase: move to systemd unit file (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/244647 (https://phabricator.wikimedia.org/T103134) (owner: 10Filippo Giunchedi)
[10:15:55] <grrrit-wm>	 (03PS4) 10ArielGlenn: dumps: move more classes into library, refactor link/feed/etc handling [dumps] (ariel) - 10https://gerrit.wikimedia.org/r/250921 
[10:22:28] <wikibugs>	 6operations, 10hardware-requests: Allocate hardware for salt master in eqiad - https://phabricator.wikimedia.org/T115288#1792780 (10ArielGlenn) @mark?
[10:24:31] <grrrit-wm>	 (03CR) 10ArielGlenn: [C: 032 V: 032] dumps: move more classes into library, refactor link/feed/etc handling [dumps] (ariel) - 10https://gerrit.wikimedia.org/r/250921 (owner: 10ArielGlenn)
[10:24:47] <grrrit-wm>	 (03PS2) 10ArielGlenn: dumps: more refactoring of classes [dumps] (ariel) - 10https://gerrit.wikimedia.org/r/251931 
[10:33:31] <grrrit-wm>	 (03CR) 10Muehlenhoff: "The error in puppet compiler is caused by a merge conflict, since this patch depends on the now abandoned" [puppet] - 10https://gerrit.wikimedia.org/r/250078 (owner: 10Dzahn)
[10:34:59] <grrrit-wm>	 (03CR) 10ArielGlenn: [C: 032 V: 032] dumps: more refactoring of classes [dumps] (ariel) - 10https://gerrit.wikimedia.org/r/251931 (owner: 10ArielGlenn)
[10:35:11] <grrrit-wm>	 (03PS2) 10ArielGlenn: dumps: split the huge jobs module into several manageable ones [dumps] (ariel) - 10https://gerrit.wikimedia.org/r/251932 
[10:36:01] <godog>	 mobrovac: heads up re: cassandra multiple instances in production, ping me when you are about
[10:53:07] <dcausse>	 !log resuming writes to elasticsearch indices in eqiad (test)
[10:53:10] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[10:54:29] <grrrit-wm>	 (03PS1) 10Hashar: Jenkins: sync cli-shutdown.groovy from upstream [puppet] - 10https://gerrit.wikimedia.org/r/251935 (https://phabricator.wikimedia.org/T118064) 
[11:00:31] <wikibugs>	 6operations, 10Gitblit: Accessing raw link on git.wikimedia.org causes "Error Sorry, the repository mediawiki does not have a extensions branch!" - https://phabricator.wikimedia.org/T118156#1792867 (10saper) 3NEW
[11:01:51] <wikibugs>	 6operations, 10Gitblit: Accessing raw link on git.wikimedia.org causes "Error Sorry, the repository mediawiki does not have a extensions branch!" - https://phabricator.wikimedia.org/T118156#1792880 (10saper) http://git.wikimedia.org/raw/mediawiki%2Fextensions%2FSemanticMediaWiki.git/master/COPYING  has the sam...
[11:09:20] <hashar>	 !log Upgrading Jenkins from LTS 1.609.3 to LTS 1.625.1 https://phabricator.wikimedia.org/T118157
[11:09:24] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[11:14:08] <grrrit-wm>	 (03PS1) 10Muehlenhoff: Enable ferm on kafka1014 [puppet] - 10https://gerrit.wikimedia.org/r/251936 
[11:16:00] <hashar>	 !log Jenkins back and apparently happy
[11:16:04] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[11:17:30] <grrrit-wm>	 (03PS2) 10Muehlenhoff: Enable ferm on kafka1014 [puppet] - 10https://gerrit.wikimedia.org/r/251936 
[11:18:07] <wikibugs>	 6operations, 10Continuous-Integration-Infrastructure, 7Jenkins, 7WorkType-Maintenance: Please refresh Jenkins package on apt.wikimedia.org to 1.625.1 - https://phabricator.wikimedia.org/T118158#1792900 (10hashar) 3NEW
[11:23:29] <wikibugs>	 6operations, 7Database: dbtree fails to render correctly on a new server (mw1152) both with zend php and hhvm - https://phabricator.wikimedia.org/T118159#1792947 (10Joe) 3NEW
[11:23:48] <_joe_>	 jynus: I marked this task "Database" ^^ but feel free to strip the label
[11:26:58] <jynus>	 _joe_, that is an answer from memcached
[11:28:08] <jynus>	 oh, no sorry
[11:28:13] <jynus>	 I misread the line
[11:28:23] <jynus>	 it is mysql_fetch_array
[11:29:40] <_joe_>	 yup
[11:36:13] <grrrit-wm>	 (03PS7) 10Muehlenhoff: Create a define to register extra LDAP schemas [puppet] - 10https://gerrit.wikimedia.org/r/251229 (https://phabricator.wikimedia.org/T101299) 
[11:37:07] <grrrit-wm>	 (03CR) 10Muehlenhoff: [C: 032 V: 032] Create a define to register extra LDAP schemas [puppet] - 10https://gerrit.wikimedia.org/r/251229 (https://phabricator.wikimedia.org/T101299) (owner: 10Muehlenhoff)
[11:39:22] <wikibugs>	 6operations, 10Deployment-Systems, 6Release-Engineering-Team: deployment broken on wdqs1001 - https://phabricator.wikimedia.org/T118148#1793022 (10hashar)
[11:42:21] <grrrit-wm>	 (03PS5) 10Muehlenhoff: openldap: Allow configurable ACLs [puppet] - 10https://gerrit.wikimedia.org/r/251272 (https://phabricator.wikimedia.org/T101299) 
[11:44:42] <grrrit-wm>	 (03PS1) 10Jcrespo: Depooling again db1060 for more maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/251939 
[11:44:58] <grrrit-wm>	 (03PS2) 10Jcrespo: Depooling again db1060 for more maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/251939 
[11:51:37] <grrrit-wm>	 (03CR) 10Jcrespo: [C: 032] Depooling again db1060 for more maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/251939 (owner: 10Jcrespo)
[11:51:59] <grrrit-wm>	 (03Merged) 10jenkins-bot: Depooling again db1060 for more maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/251939 (owner: 10Jcrespo)
[12:00:11] <mobrovac>	 godog: pong
[12:01:25] <godog>	 mobrovac: ack! looks like we're good to go? I'll prepare the code reviews
[12:02:18] <mobrovac>	 godog: let me check something first
[12:02:36] <godog>	 mobrovac: yeah it'll take some time anyway
[12:02:57] <grrrit-wm>	 (03CR) 10Muehlenhoff: "The debdeploy/Hiera part of the change works fine, but puppet compiler also shows changes to the sshd config?" [puppet] - 10https://gerrit.wikimedia.org/r/250659 (owner: 10Muehlenhoff)
[12:10:41] <logmsgbot>	 !log jynus@tin Synchronized wmf-config/db-eqiad.php: Depooling db1060 again (duration: 01m 13s)
[12:10:44] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[12:14:41] <wikibugs>	 6operations, 10Wikibase-Quality, 10Wikidata, 7Database: Wikibase dispatchChanges.php runs slow, creates an absurd amount of database connections - https://phabricator.wikimedia.org/T118162#1793064 (10Joe) 3NEW
[12:14:45] <_joe_>	 aude: ^^
[12:14:56] <_joe_>	 please add tags that I might have missed
[12:15:15] <wikibugs>	 6operations, 10Gitblit: Accessing raw link on git.wikimedia.org causes "Error Sorry, the repository mediawiki does not have a extensions branch!" - https://phabricator.wikimedia.org/T118156#1793074 (10Aklapper) Same as T117459 ? Not sure why "operations" was added to this task?
[12:23:03] <grrrit-wm>	 (03PS1) 10Filippo Giunchedi: cassandra: additional instances for eqiad/codfw production [dns] - 10https://gerrit.wikimedia.org/r/251941 (https://phabricator.wikimedia.org/T95250) 
[12:23:14] <grrrit-wm>	 (03PS2) 10Muehlenhoff: Reorg server groups for debdeploy [puppet] - 10https://gerrit.wikimedia.org/r/250932 
[12:25:09] <wikibugs>	 6operations, 7Database: dbtree fails to render correctly on a new server (mw1152) both with zend php and hhvm - https://phabricator.wikimedia.org/T118159#1793083 (10Krenair) Did you fix this? It looks like it's working to me.
[12:25:24] <_joe_>	 Krenair: yes we fixed it
[12:25:36] <_joe_>	 but jaime did, and he's performing an interview
[12:25:58] <grrrit-wm>	 (03PS2) 10Filippo Giunchedi: cassandra: additional instances for eqiad/codfw production [dns] - 10https://gerrit.wikimedia.org/r/251941 (https://phabricator.wikimedia.org/T95250) 
[12:26:04] <grrrit-wm>	 (03CR) 10Filippo Giunchedi: [C: 032 V: 032] cassandra: additional instances for eqiad/codfw production [dns] - 10https://gerrit.wikimedia.org/r/251941 (https://phabricator.wikimedia.org/T95250) (owner: 10Filippo Giunchedi)
[12:27:11] <grrrit-wm>	 (03CR) 10Muehlenhoff: [C: 032 V: 032] Reorg server groups for debdeploy [puppet] - 10https://gerrit.wikimedia.org/r/250932 (owner: 10Muehlenhoff)
[12:28:00] <Krenair>	 ok
[12:28:41] <mobrovac>	 godog: need to do a deploy before you decommission one server in codfw so that DTCS doesn't get reverted when restbase boots there
[12:29:23] <_joe_>	 every time I read these things ^^, it's mildly terrifying, you know that?
[12:30:06] <mobrovac>	 i know _joe_
[12:32:40] <mobrovac>	 godog: wait, you want to add 3x instances in both eqiad and codfw at the same time or am i reading https://gerrit.wikimedia.org/r/251941 wrong?
[12:33:25] <godog>	 mobrovac: nah that's just dns
[12:33:33] <mobrovac>	 k
[12:35:06] <icinga-wm_>	 PROBLEM - puppet last run on elastic1014 is CRITICAL: CRITICAL: Puppet has 1 failures
[12:39:00] <grrrit-wm>	 (03PS2) 10Filippo Giunchedi: base: further clarify service_unit ensure [puppet] - 10https://gerrit.wikimedia.org/r/251933 
[12:39:07] <grrrit-wm>	 (03CR) 10Filippo Giunchedi: [C: 032 V: 032] base: further clarify service_unit ensure [puppet] - 10https://gerrit.wikimedia.org/r/251933 (owner: 10Filippo Giunchedi)
[12:39:52] <wikibugs>	 6operations, 10MediaWiki-extensions-WikibaseRepository, 10Wikidata, 7Database: Wikibase dispatchChanges.php runs slow, creates an absurd amount of database connections - https://phabricator.wikimedia.org/T118162#1793130 (10Addshore)
[12:42:10] <wikibugs>	 6operations, 10MediaWiki-extensions-WikibaseRepository, 10Wikidata, 7Database: Wikibase dispatchChanges.php runs slow, creates an absurd amount of database connections - https://phabricator.wikimedia.org/T118162#1793142 (10Addshore) Related:    - T111353   - T108944   - T48643   - T70381   - T70382
[12:48:30] <wikibugs>	 6operations, 10ops-esams: Power cr2-esams PEM 2/PEM 3 - https://phabricator.wikimedia.org/T118166#1793163 (10faidon) 3NEW a:3mark
[12:52:40] <grrrit-wm>	 (03PS2) 10Alexandros Kosiaris: Jenkins: sync cli-shutdown.groovy from upstream [puppet] - 10https://gerrit.wikimedia.org/r/251935 (https://phabricator.wikimedia.org/T118064) (owner: 10Hashar)
[12:52:48] <grrrit-wm>	 (03CR) 10Alexandros Kosiaris: [C: 032 V: 032] Jenkins: sync cli-shutdown.groovy from upstream [puppet] - 10https://gerrit.wikimedia.org/r/251935 (https://phabricator.wikimedia.org/T118064) (owner: 10Hashar)
[12:53:21] <hashar>	 akosiaris: thanks! will run puppet and get jenkins restarted :-}
[12:53:56] <akosiaris>	 hashar: yw
[12:54:25] <grrrit-wm>	 (03CR) 10Gilles: [C: 04-1] swift: monitor mediawiki originals upload rate (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/251526 (https://phabricator.wikimedia.org/T92322) (owner: 10Filippo Giunchedi)
[12:55:47] <grrrit-wm>	 (03CR) 10Alexandros Kosiaris: [C: 031] openldap: Allow configurable ACLs [puppet] - 10https://gerrit.wikimedia.org/r/251272 (https://phabricator.wikimedia.org/T101299) (owner: 10Muehlenhoff)
[12:55:57] <icinga-wm_>	 PROBLEM - puppet last run on mw1132 is CRITICAL: CRITICAL: Puppet has 1 failures
[12:56:38] <hashar>	 !log restarting Jenkins to refresh the cli-shutdown.groovy script -- https://gerrit.wikimedia.org/r/251935 (https://phabricator.wikimedia.org/T118064)
[12:56:42] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[12:57:58] <grrrit-wm>	 (03PS1) 10Muehlenhoff: More tweaks for server groups [puppet] - 10https://gerrit.wikimedia.org/r/251943 
[12:58:13] <grrrit-wm>	 (03PS2) 10Muehlenhoff: More tweaks for server groups [puppet] - 10https://gerrit.wikimedia.org/r/251943 
[12:59:06] <mobrovac>	 !log restbase start deploying ae2a44f
[12:59:09] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[12:59:32] <grrrit-wm>	 (03PS3) 10Alexandros Kosiaris: exim: Add and use $::other_site to provide LDAP fallback [puppet] - 10https://gerrit.wikimedia.org/r/249868 (https://phabricator.wikimedia.org/T82662) 
[12:59:34] <grrrit-wm>	 (03PS2) 10Alexandros Kosiaris: exim: removal of non-DC aware ldap-mirror CNAME [puppet] - 10https://gerrit.wikimedia.org/r/250438 
[12:59:55] <grrrit-wm>	 (03CR) 10Muehlenhoff: [C: 032 V: 032] More tweaks for server groups [puppet] - 10https://gerrit.wikimedia.org/r/251943 (owner: 10Muehlenhoff)
[13:01:27] <wikibugs>	 6operations, 10Beta-Cluster-Infrastructure: Can't apply ::role::logging::mediawiki on a trusty host - https://phabricator.wikimedia.org/T98627#1793203 (10hashar) 5Open>3Invalid a:3hashar deployment-fluorine has been rebuild as a Precise host. No point in keeping this task around, whenever one migrates it...
[13:01:36] <icinga-wm_>	 RECOVERY - puppet last run on elastic1014 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[13:06:38] <grrrit-wm>	 (03PS1) 10Filippo Giunchedi: install_server: cassandra multi instance in eqiad/codfw [puppet] - 10https://gerrit.wikimedia.org/r/251944 
[13:06:40] <grrrit-wm>	 (03PS1) 10Filippo Giunchedi: cassandra: add restbase[12]00[12] to seeds [puppet] - 10https://gerrit.wikimedia.org/r/251945 
[13:08:07] <icinga-wm_>	 PROBLEM - Kafka Broker Replica Max Lag on kafka1013 is CRITICAL: CRITICAL: 83.33% of data above the critical threshold [5000000.0]
[13:10:37] <grrrit-wm>	 (03PS1) 10Muehlenhoff: Fix grain name [puppet] - 10https://gerrit.wikimedia.org/r/251946 
[13:10:47] <grrrit-wm>	 (03PS2) 10Filippo Giunchedi: install_server: cassandra multi instance in eqiad/codfw [puppet] - 10https://gerrit.wikimedia.org/r/251944 
[13:10:56] <grrrit-wm>	 (03CR) 10Filippo Giunchedi: [C: 032 V: 032] install_server: cassandra multi instance in eqiad/codfw [puppet] - 10https://gerrit.wikimedia.org/r/251944 (owner: 10Filippo Giunchedi)
[13:11:15] <grrrit-wm>	 (03PS2) 10Filippo Giunchedi: cassandra: add restbase[12]00[12] to seeds [puppet] - 10https://gerrit.wikimedia.org/r/251945 
[13:11:23] <grrrit-wm>	 (03CR) 10Filippo Giunchedi: [C: 032 V: 032] cassandra: add restbase[12]00[12] to seeds [puppet] - 10https://gerrit.wikimedia.org/r/251945 (owner: 10Filippo Giunchedi)
[13:11:33] <wikibugs>	 6operations, 10Beta-Cluster-Infrastructure, 7Performance: Need a way to simulate replication lag to test replag issues - https://phabricator.wikimedia.org/T40945#1793233 (10hashar) 5Open>3stalled a:5Nikerabbit>3None
[13:13:24] <mobrovac>	 !log restbase finished deploying ae2a44f
[13:13:27] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[13:13:47] <mobrovac>	 godog: deploy done
[13:14:40] <grrrit-wm>	 (03PS2) 10Muehlenhoff: Fix grain name [puppet] - 10https://gerrit.wikimedia.org/r/251946 
[13:14:57] <grrrit-wm>	 (03CR) 10Muehlenhoff: [C: 032 V: 032] Fix grain name [puppet] - 10https://gerrit.wikimedia.org/r/251946 (owner: 10Muehlenhoff)
[13:15:53] <grrrit-wm>	 (03CR) 10ArielGlenn: [C: 032 V: 032] dumps: split the huge jobs module into several manageable ones [dumps] (ariel) - 10https://gerrit.wikimedia.org/r/251932 (owner: 10ArielGlenn)
[13:16:26] <godog>	 mobrovac: ack! I'll decomission restbase2001
[13:16:48] <godog>	 !log nodetool decomission restbase2001
[13:16:52] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[13:21:17] <icinga-wm_>	 RECOVERY - Kafka Broker Replica Max Lag on kafka1013 is OK: OK: Less than 1.00% above the threshold [1000000.0]
[13:22:26] <icinga-wm_>	 RECOVERY - puppet last run on mw1132 is OK: OK: Puppet is currently enabled, last run 51 seconds ago with 0 failures
[13:23:49] <godog>	 mobrovac: it is decomissioning, I'll go for lunch meanwhile
[13:25:16] <paravoid>	 moritzm: merging your change
[13:27:55] <moritzm>	 thanks, sorry
[13:28:17] <grrrit-wm>	 (03CR) 10Faidon Liambotis: [C: 031] "LGTM for whenever you feel it's time to merge." [puppet] - 10https://gerrit.wikimedia.org/r/250438 (owner: 10Alexandros Kosiaris)
[13:28:34] <paravoid>	 np
[13:32:57] <icinga-wm_>	 RECOVERY - Apache HTTP on mw1189 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.066 second response time
[13:32:57] <icinga-wm_>	 RECOVERY - HHVM rendering on mw1189 is OK: HTTP OK: HTTP/1.1 200 OK - 70590 bytes in 0.314 second response time
[13:38:58] <grrrit-wm>	 (03CR) 10BBlack: [C: 04-1] "Well IE8/XP wouldn't be the only case, there are those other minority clients that we're unlikely to ever see on misc, or would see extrem" [puppet] - 10https://gerrit.wikimedia.org/r/251704 (owner: 10BBlack)
[13:39:34] <grrrit-wm>	 (03Abandoned) 10BBlack: set cache_misc to "mid" ciphersuite [puppet] - 10https://gerrit.wikimedia.org/r/251704 (owner: 10BBlack)
[13:39:57] <icinga-wm_>	 PROBLEM - puppet last run on cp1067 is CRITICAL: CRITICAL: Puppet has 1 failures
[13:41:21] <wikibugs>	 6operations, 5Continuous-Integration-Scaling, 7Nodepool: Backport python-shade from debian/testing to jessie-wikimedia - https://phabricator.wikimedia.org/T107267#1793341 (10hashar) a:3hashar
[13:49:57] <wikibugs>	 6operations, 10hardware-requests: Detail codfw snapshot/dataset requirements - https://phabricator.wikimedia.org/T118173#1793344 (10RobH) 3NEW a:3RobH
[13:51:55] <wikibugs>	 6operations, 10hardware-requests: Detail codfw snapshot/dataset requirements - https://phabricator.wikimedia.org/T118173#1793353 (10RobH) Below is a copy from my old email entry in the RT ticket:  We need to replace the snapshot and dataset infrastructure from Tampa for CODFW.  All the hardware was out of warr...
[13:53:16] <icinga-wm_>	 RECOVERY - puppet last run on hafnium is OK: OK: Puppet is currently enabled, last run 20 seconds ago with 0 failures
[13:53:50] <wikibugs>	 6operations, 10hardware-requests: Detail codfw snapshot/dataset requirements - https://phabricator.wikimedia.org/T118173#1793354 (10RobH) a:5RobH>3ArielGlenn Assigning to Ariel for overall input and detailing of the requirements for a snapshot/dataset cluster host in codfw.  Keep in mind we'll have to orde...
[14:04:48] <icinga-wm_>	 RECOVERY - puppet last run on cp1067 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[14:05:29] <grrrit-wm>	 (03PS1) 10Faidon Liambotis: Kill CirrusSearch-slow-queries alert [puppet] - 10https://gerrit.wikimedia.org/r/251948 (https://phabricator.wikimedia.org/T84163) 
[14:05:50] <grrrit-wm>	 (03CR) 10Faidon Liambotis: [C: 032] Kill CirrusSearch-slow-queries alert [puppet] - 10https://gerrit.wikimedia.org/r/251948 (https://phabricator.wikimedia.org/T84163) (owner: 10Faidon Liambotis)
[14:06:43] <grrrit-wm>	 (03PS6) 10Muehlenhoff: openldap: Allow configurable ACLs [puppet] - 10https://gerrit.wikimedia.org/r/251272 (https://phabricator.wikimedia.org/T101299) 
[14:09:17] <grrrit-wm>	 (03PS1) 10Muehlenhoff: Assign salt grains for cp* hosts [puppet] - 10https://gerrit.wikimedia.org/r/251949 
[14:10:17] <grrrit-wm>	 (03PS1) 10Giuseppe Lavagetto: dbtree: move to its own directory [puppet] - 10https://gerrit.wikimedia.org/r/251950 
[14:11:20] <grrrit-wm>	 (03PS2) 10Giuseppe Lavagetto: dbtree: move to its own directory [puppet] - 10https://gerrit.wikimedia.org/r/251950 
[14:14:02] <grrrit-wm>	 (03CR) 10Giuseppe Lavagetto: [C: 032] dbtree: move to its own directory [puppet] - 10https://gerrit.wikimedia.org/r/251950 (owner: 10Giuseppe Lavagetto)
[14:15:02] <grrrit-wm>	 (03PS1) 10Muehlenhoff: zookeeper: Don't expose the JMX port in ferm rules [puppet] - 10https://gerrit.wikimedia.org/r/251951 
[14:15:22] <grrrit-wm>	 (03CR) 10Ottomata: [C: 031] Enable ferm on kafka1014 [puppet] - 10https://gerrit.wikimedia.org/r/251936 (owner: 10Muehlenhoff)
[14:15:29] <ottomata>	 moritzm:  let's do it!
[14:17:24] <moritzm>	 ottomata: ok, I'll rebase, merge and puppet-run it
[14:17:52] <wikibugs>	 6operations, 6Phabricator: migrate RT main-announce into phabricator - https://phabricator.wikimedia.org/T118176#1793387 (10RobH) 3NEW
[14:17:58] <grrrit-wm>	 (03PS3) 10Muehlenhoff: Enable ferm on kafka1014 [puppet] - 10https://gerrit.wikimedia.org/r/251936 
[14:18:12] <wikibugs>	 6operations, 6Phabricator: migrate RT main-announce into phabricator - https://phabricator.wikimedia.org/T118176#1793394 (10RobH) a:3RobH I'll detail out how mail is routed and how we triage the requests shortly.
[14:18:49] <grrrit-wm>	 (03PS2) 10Andrew Bogott: Assign IPs for public labtest hosts. [dns] - 10https://gerrit.wikimedia.org/r/251655 
[14:19:31] <grrrit-wm>	 (03CR) 10Muehlenhoff: [C: 032 V: 032] Enable ferm on kafka1014 [puppet] - 10https://gerrit.wikimedia.org/r/251936 (owner: 10Muehlenhoff)
[14:19:33] <grrrit-wm>	 (03CR) 10Andrew Bogott: [C: 032] Assign IPs for public labtest hosts. [dns] - 10https://gerrit.wikimedia.org/r/251655 (owner: 10Andrew Bogott)
[14:20:01] <ottomata>	 moritzm: i'm watching kafka logs on a couple of brokers
[14:20:05] <ottomata>	 dooo it
[14:20:06] <ottomata>	 :)
[14:20:22] <moritzm>	 puppet run is ongoing, will add some logging rules once done
[14:20:29] <wikibugs>	 6operations, 10hardware-requests: Allocate hardware for salt master in eqiad - https://phabricator.wikimedia.org/T115288#1793396 (10mark) >>! In T115288#1754857, @RobH wrote: > Chatted with Ariel in IRC. >  > Going to go with one of the: >  > Dell PowerEdge R420, Dual Intel Xeon E5-2440, 32GB Memory, Dual 300G...
[14:20:30] <grrrit-wm>	 (03CR) 10Ottomata: [C: 031] zookeeper: Don't expose the JMX port in ferm rules [puppet] - 10https://gerrit.wikimedia.org/r/251951 (owner: 10Muehlenhoff)
[14:21:28] <icinga-wm_>	 RECOVERY - puppet last run on terbium is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[14:22:36] <moritzm>	 ottomata: up and enabled, nothing in the logs so far
[14:22:57] <grrrit-wm>	 (03PS1) 10Giuseppe Lavagetto: terbium: do not include noc [puppet] - 10https://gerrit.wikimedia.org/r/251952 
[14:23:18] <wikibugs>	 6operations, 10ops-eqiad: remove patch cable for cr1-eqiad:xe-4/2/1 ID 3482, circuit ID ETYX/084858//ZYO - https://phabricator.wikimedia.org/T118177#1793400 (10RobH) 3NEW a:3Cmjohnson
[14:23:37] <icinga-wm_>	 PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 10 data above and 2 below the confidence bounds
[14:24:25] <grrrit-wm>	 (03PS1) 10Andrew Bogott: Nova scheduler changes: [puppet] - 10https://gerrit.wikimedia.org/r/251954 
[14:25:07] <grrrit-wm>	 (03PS1) 10Faidon Liambotis: Remove subnet for ulsfo-eqiad Giglinx link [dns] - 10https://gerrit.wikimedia.org/r/251955 (https://phabricator.wikimedia.org/T118170) 
[14:25:21] <grrrit-wm>	 (03CR) 10Giuseppe Lavagetto: [C: 032] terbium: do not include noc [puppet] - 10https://gerrit.wikimedia.org/r/251952 (owner: 10Giuseppe Lavagetto)
[14:27:06] <ottomata>	 hmm, i see more broken pipe errors on ka14 than i would expect, i thikn...
[14:27:18] <ottomata>	 yeah not looking good moritzm
[14:27:23] <ottomata>	 hold
[14:28:19] <moritzm>	 ottomata: I suppose that's some fallout of the rules kicking into live operation, there's no dropped traffic on 1014 per se
[14:28:45] <ottomata>	 no?
[14:29:04] <ottomata>	 trying to see what's going on though, replica ISRs have shrunk
[14:29:15] <wikibugs>	 6operations, 5Continuous-Integration-Scaling, 7Nodepool: Backport python-shade from debian/testing to jessie-wikimedia - https://phabricator.wikimedia.org/T107267#1793414 (10hashar) I tried but I eventually give up. The toolchain is just too complicated for me to figure out.  So at first the /debian/  source...
[14:32:39] <ottomata>	 moritzm: this enabled base ferm on kafka1014, right?
[14:33:05] <ottomata>	 moritzm: i think we should roll back
[14:33:15] <moritzm>	 ottomata: ok, we can do that
[14:33:16] <grrrit-wm>	 (03PS2) 10Andrew Bogott: Nova scheduler changes: [puppet] - 10https://gerrit.wikimedia.org/r/251954 
[14:34:20] <ottomata>	 yeah, and consumers seem to have stopped working too from ka14
[14:34:31] <ottomata>	  Could not receive response to request ... Kafka @ kafka1014.eqiad.wmnet:9092 went away
[14:34:42] <ottomata>	     data = self._sock.recv(min(bytes_left, 4096))
[14:34:42] <ottomata>	 timeout: timed out
[14:35:24] <ottomata>	 its strange though
[14:35:44] <ottomata>	 kafka1012 seems not able to replicate from kafka1014
[14:35:48] <ottomata>	 but other brokers seem to be doing so fine.
[14:36:10] <_joe_>	 !log reducing /tmp size on copper by shrinking the logical volume
[14:36:14] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[14:36:23] <grrrit-wm>	 (03PS1) 10Muehlenhoff: Revert "Enable ferm on kafka1014" [puppet] - 10https://gerrit.wikimedia.org/r/251962 
[14:36:41] <ottomata>	 but, the consumers on eventlog1001 can't consume from kakka1014?
[14:36:41] <ottomata>	 hmmMmm
[14:36:56] <ottomata>	 really not sure what is going on here, am a little worried, there's some weird stuff going on with offset requests being out of range
[14:37:17] <grrrit-wm>	 (03CR) 10Andrew Bogott: [C: 032] Nova scheduler changes: [puppet] - 10https://gerrit.wikimedia.org/r/251954 (owner: 10Andrew Bogott)
[14:37:41] <moritzm>	 1012 is 10.64.5.12 and that is allowed in the rules for 9092
[14:38:26] <ottomata>	 moritzm: ja, but it has dropped out of in sync replica list for partitions where 1014 is the leader
[14:38:32] <moritzm>	 my guess is that during the rules setup is 1014 was briefly unavailable and the others are acting on that (but we can also revert, see 251962 )
[14:39:00] <ottomata>	 if that was the case it would recover quickly, no?
[14:39:08] <ottomata>	 also, eventlog1001 had a hiccup at the very least
[14:39:23] <ottomata>	 ok moritzm, hold for one sec before merging, going to restart eventlogging and see what happens in logs there, maybe it can consume...
[14:40:04] <ottomata>	 !log restarting eventlogging to see if it is ok after enabling firewall rules on kafka1014
[14:40:08] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[14:40:14] <moritzm>	 ok
[14:43:12] <ottomata>	 ok moritzm, lets revert
[14:43:19] <ottomata>	 i dunno what is happening, but it doesn't look normal
[14:43:54] <icinga-wm_>	 PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 11 data above and 7 below the confidence bounds
[14:44:44] <moritzm>	 ottomata: ok, going ahead
[14:45:11] <grrrit-wm>	 (03PS2) 10Muehlenhoff: Revert "Enable ferm on kafka1014" [puppet] - 10https://gerrit.wikimedia.org/r/251962 
[14:45:35] <moritzm>	 ottomata: I stopped ferm manually and will merge next
[14:45:50] <ottomata>	 ok
[14:45:52] <grrrit-wm>	 (03CR) 10Muehlenhoff: [C: 032 V: 032] Revert "Enable ferm on kafka1014" [puppet] - 10https://gerrit.wikimedia.org/r/251962 (owner: 10Muehlenhoff)
[14:46:47] <ottomata>	 moritzm: why would there be a hiccup at all?  does ferm enable the full firewall before allowing connections to the configured ports?
[14:46:48] <wikibugs>	 7Puppet, 5Patch-For-Review, 7Ruby: Fix easy problems reported by RuboCop in operations/puppet - https://phabricator.wikimedia.org/T112651#1793448 (10zeljkofilipin)
[14:47:09] <wikibugs>	 7Puppet, 5Patch-For-Review, 7Ruby: Fix easy problems reported by RuboCop in operations/puppet - https://phabricator.wikimedia.org/T112651#1793450 (10zeljkofilipin) 5Open>3stalled
[14:48:20] <godog>	 !log swift codfw-prod: ms-be2017 / ms-be2019 / ms-be2021 weight 1000
[14:48:23] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[14:48:48] <ottomata>	 moritzm: i wonder if we should try stopping kafka on kafka1014
[14:49:01] <ottomata>	 to allow consumers and producers and replicas to just CHILL
[14:49:06] <ottomata>	 then enable ferm.
[14:49:09] <ottomata>	 then start kafka back up.
[14:49:10] <wikibugs>	 6operations, 6Phabricator: migrate RT maint-announce into phabricator - https://phabricator.wikimedia.org/T118176#1793452 (10scfc)
[14:49:51] <moritzm>	 ottomata: that's caused by an interaction between ferm and puppet, since the rules which allow the granted traffic are added in steps, once the rules files in /etc/ferm/conf.d are fully generated, the rules are put into effect immediately
[14:50:30] <moritzm>	 ottomata: yeah, stopping kafka before making the flip would indeed probably make sense here
[14:52:22] <ottomata>	 ok, moritzm, the more I look, the more I think that doing that will be ok...  I do see some really strange offset requests, but afaict everything was working fine. I think the replicas I saw out of sync were not really (https://issues.apache.org/jira/browse/KAFKA-1367) (I forgot about this confusing issue).  
[14:52:24] <icinga-wm_>	 PROBLEM - puppet last run on mw2030 is CRITICAL: CRITICAL: Puppet has 1 failures
[14:54:02] <moritzm>	 seems so (after all there were no failed connection attempts to 1014 at all), shall we make another try with a stopped kafka broker before enabling it?
[14:55:23] <ottomata>	 ja think so, i'm watching offsets for an eventlogging consumer now too, which will help me feel a little better when i see weird offset requests.
[14:55:31] <ottomata>	 moritzm: i'll take broker on 1014 down now...
[14:56:34] <ottomata>	 !log stopping kafka broker on kafka1014
[14:56:37] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[14:56:52] <ottomata>	 hmm, moritzm, puppet will restart kafka
[14:56:58] <ottomata>	 can you apply the ferm changes manually?
[14:57:11] <wikibugs>	 6operations, 10ops-eqiad, 10netops: test new sfp-t - https://phabricator.wikimedia.org/T118178#1793456 (10RobH) 3NEW a:3Cmjohnson
[14:58:46] <moritzm>	 ottomata: for 1014 I can simply restart it (since puppet created all the rules already)
[14:59:54] <ottomata>	 ok moritzm, kafka is stopped on 1014
[14:59:56] <ottomata>	 go ahead
[15:00:17] <moritzm>	 it's re-enabled
[15:00:44] <icinga-wm_>	 PROBLEM - puppet last run on mw1060 is CRITICAL: CRITICAL: Puppet has 1 failures
[15:00:44] <icinga-wm_>	 PROBLEM - Kafka Broker Under Replicated Partitions on kafka1018 is CRITICAL: CRITICAL: 14.29% of data above the critical threshold [10.0]
[15:02:09] <ottomata>	 ok
[15:02:12] <icinga-wm_>	 PROBLEM - Kafka Broker Under Replicated Partitions on kafka1020 is CRITICAL: CRITICAL: 55.56% of data above the critical threshold [10.0]
[15:02:18] <ottomata>	 !starting kafka broker on kafka1014
[15:02:28] <ottomata>	 !log starting kafka broker on kafka1014
[15:02:31] <icinga-wm_>	 PROBLEM - Kafka Broker Under Replicated Partitions on kafka1012 is CRITICAL: CRITICAL: 37.50% of data above the critical threshold [10.0]
[15:02:35] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[15:02:36] <jynus>	 I cracked the code for recentchanges efficienty, but I need someone to dump my mind and reach to an implementation
[15:03:05] <jynus>	 maybe I should way for someone at perf
[15:03:08] <jynus>	 *Wait
[15:05:56] <wikibugs>	 6operations, 10Traffic: Planning for phasing out non-Forward-Secret TLS ciphers - https://phabricator.wikimedia.org/T118181#1793501 (10BBlack) 3NEW
[15:06:52] <ottomata>	 !log running kafka preferred-replica-election
[15:06:52] <icinga-wm_>	 PROBLEM - Kafka Broker Under Replicated Partitions on kafka1013 is CRITICAL: CRITICAL: 71.43% of data above the critical threshold [10.0]
[15:06:55] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[15:08:22] <icinga-wm_>	 PROBLEM - Kafka Broker Under Replicated Partitions on kafka1018 is CRITICAL: CRITICAL: 66.67% of data above the critical threshold [10.0]
[15:09:04] <wikibugs>	 6operations, 6Discovery: Fix CirrusSearch monitoring - https://phabricator.wikimedia.org/T84163#1793514 (10chasemp)
[15:09:32] <icinga-wm_>	 PROBLEM - Kafka Broker Under Replicated Partitions on kafka1022 is CRITICAL: CRITICAL: 50.00% of data above the critical threshold [10.0]
[15:09:34] <grrrit-wm>	 (03PS1) 10Muehlenhoff: Enable ferm on kafka1014 again [puppet] - 10https://gerrit.wikimedia.org/r/251964 
[15:09:35] <ottomata>	 ja, k moritzm, things are looking ok
[15:09:42] <icinga-wm_>	 RECOVERY - Router interfaces on mr1-codfw is OK: OK: host 208.80.153.196, interfaces up: 35, down: 0, dormant: 0, excluded: 0, unused: 0
[15:09:47] <ottomata>	 (ignore those underreplicate partition alerts, they will resolve shortly.)
[15:11:24] <moritzm>	 ottomata: ok, merging so that the status in puppet and on the system is in sync again
[15:11:33] <icinga-wm_>	 PROBLEM - Kafka Broker Under Replicated Partitions on kafka1012 is CRITICAL: CRITICAL: 16.67% of data above the critical threshold [10.0]
[15:11:51] <grrrit-wm>	 (03CR) 10Muehlenhoff: [C: 032 V: 032] Enable ferm on kafka1014 again [puppet] - 10https://gerrit.wikimedia.org/r/251964 (owner: 10Muehlenhoff)
[15:11:57] <wikibugs>	 6operations, 10ops-eqiad, 7Swift: [determine] rack ms-be1019-1021 - https://phabricator.wikimedia.org/T114711#1793526 (10fgiunchedi) I believe we went with 1G everywhere @cmjohnson? in any case looks like this is complete
[15:12:16] <wikibugs>	 6operations, 6Phabricator: migrate RT maint-announce into phabricator - https://phabricator.wikimedia.org/T118176#1793527 (10RobH) We'll need to modify our workflow.  Right now in RT, maint-annoucements come in multiple times for a single event.  We'll typically get an initial notification of maintenance, then...
[15:12:42] <icinga-wm_>	 RECOVERY - Kafka Broker Under Replicated Partitions on kafka1013 is OK: OK: Less than 1.00% above the threshold [1.0]
[15:13:08] <moritzm>	 ottomata: it's merged. ok to re-enable the puppet agent?
[15:13:29] <wikibugs>	 6operations, 7Swift: add ms-be1019 / 1020 / 1021 to swift - https://phabricator.wikimedia.org/T118183#1793530 (10fgiunchedi) 3NEW a:3fgiunchedi
[15:13:31] <icinga-wm_>	 RECOVERY - Kafka Broker Under Replicated Partitions on kafka1022 is OK: OK: Less than 1.00% above the threshold [1.0]
[15:13:47] <wikibugs>	 6operations, 6Phabricator: migrate RT maint-announce into phabricator - https://phabricator.wikimedia.org/T118176#1793539 (10RobH) We'll need to modify our workflow.  Right now in RT, maint-annoucements come in multiple times for a single event.  We'll typically get an initial notification of maintenance, then...
[15:13:52] <icinga-wm_>	 PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 10 data above and 6 below the confidence bounds
[15:14:11] <icinga-wm_>	 RECOVERY - Kafka Broker Under Replicated Partitions on kafka1018 is OK: OK: Less than 1.00% above the threshold [1.0]
[15:15:01] <icinga-wm_>	 RECOVERY - Kafka Broker Under Replicated Partitions on kafka1020 is OK: OK: Less than 1.00% above the threshold [1.0]
[15:15:04] <ottomata>	 moritzm:  yes
[15:15:07] <ottomata>	 do please
[15:15:32] <icinga-wm_>	 RECOVERY - Kafka Broker Under Replicated Partitions on kafka1012 is OK: OK: Less than 1.00% above the threshold [1.0]
[15:15:54] <wikibugs>	 6operations, 6Project-Creators: create #ops-eqdfw & #ops-eqord projects - https://phabricator.wikimedia.org/T117585#1793549 (10RobH) I was simply being paranoid.  I'll go ahead and create these later today and will update that page, thanks for linking it @krenair.
[15:17:52] <icinga-wm_>	 RECOVERY - puppet last run on mw2030 is OK: OK: Puppet is currently enabled, last run 1 second ago with 0 failures
[15:18:04] <wikibugs>	 7Puppet, 10Continuous-Integration-Config, 7Ruby: Move RuboCop job from experimental pipeline to the usual pipelines for operations/puppet - https://phabricator.wikimedia.org/T110019#1793567 (10zeljkofilipin)
[15:19:48] <paravoid>	 ottomata: darian was asking the other day too, I'm not sure what happened with it -- is there a replacement for http://stats.wikimedia.org/wikimedia/squids/SquidReportClients.htm ?
[15:21:18] <ottomata>	 paravoid: not that I know of, but you could get it out of https://wikitech.wikimedia.org/wiki/Analytics/Data/Pageview_hourly using user_agent_map
[15:21:29] <ottomata>	 and/or the new pageview APi in AQS (which will be announced soon? ? maybe?)
[15:21:50] <ottomata>	 paravoid: there is a lot of discussion in analytics goals now about replacing many parts of stats.wm.o, but i don't know details
[15:22:02] <icinga-wm_>	 PROBLEM - HTTPS-toolserver on www.toolserver.org is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:IO::Socket::SSL: connect: Connection timed out
[15:23:01] <ottomata>	 moritzm: all looks fine
[15:23:06] <ottomata>	 shall we proceed with other brokers?
[15:23:19] <ottomata>	 oh, sorry, just saw your other message
[15:23:31] <paravoid>	 Coren/YuviPanda/andrewbogott/chasemp: ^ toolserver.org alert
[15:24:51] <chasemp>	 seems really down too, no clue where this lives but I'll get into it
[15:24:56] <grrrit-wm>	 (03PS2) 10Muehlenhoff: zookeeper: Don't expose the JMX port in ferm rules [puppet] - 10https://gerrit.wikimedia.org/r/251951 
[15:25:30] <grrrit-wm>	 (03CR) 10Muehlenhoff: [C: 032 V: 032] zookeeper: Don't expose the JMX port in ferm rules [puppet] - 10https://gerrit.wikimedia.org/r/251951 (owner: 10Muehlenhoff)
[15:26:11] <icinga-wm_>	 RECOVERY - HTTPS-toolserver on www.toolserver.org is OK: SSL OK - Certificate toolserver.org valid until 2016-06-30 17:56:02 +0000 (expires in 234 days)
[15:27:20] <_joe_>	 chasemp: that would be on the tools-proxies first of all
[15:27:25] <_joe_>	 as this was an ssl failure
[15:27:41] <icinga-wm_>	 RECOVERY - puppet last run on mw1060 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[15:34:07] <grrrit-wm>	 (03PS1) 10Filippo Giunchedi: mediawiki: rename jobqueue.job-pop graphite alarm [puppet] - 10https://gerrit.wikimedia.org/r/251970 
[15:35:01] <grrrit-wm>	 (03CR) 10Filippo Giunchedi: [C: 032 V: 032] mediawiki: rename jobqueue.job-pop graphite alarm [puppet] - 10https://gerrit.wikimedia.org/r/251970 (owner: 10Filippo Giunchedi)
[15:42:47] <wikibugs>	 6operations, 6Labs, 10Labs-Infrastructure, 10hardware-requests, and 2 others: Labs test cluster in codfw - https://phabricator.wikimedia.org/T114435#1793660 (10chasemp)
[15:42:48] <wikibugs>	 6operations, 6Labs, 10Labs-Infrastructure, 10netops, and 3 others: Allocate subnet for labs test cluster instances - https://phabricator.wikimedia.org/T115492#1793659 (10chasemp) 5Open>3Resolved
[15:45:52] <godog>	 !log reimage graphite1002
[15:45:56] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[15:48:31] <wikibugs>	 6operations, 6Phabricator: migrate RT maint-announce into phabricator - https://phabricator.wikimedia.org/T118176#1793700 (10RobH) I've CC'd in @Dzahn.  Daniel regularly patrols maint-announce and updates the tracking calendar, so I want to ensure we have him review our potential plan(s).
[15:48:54] <icinga-wm_>	 PROBLEM - Host graphite1002 is DOWN: PING CRITICAL - Packet loss = 100%
[15:52:15] <icinga-wm_>	 RECOVERY - Host graphite1002 is UP: PING OK - Packet loss = 0%, RTA = 1.08 ms
[15:53:05] <jynus>	 graphite is in mirror mode, right?
[15:53:38] <godog>	 jynus: mirror to codfw, yes
[15:54:13] <grrrit-wm>	 (03CR) 10Subramanya Sastry: "This has been deployed already right?" [puppet] - 10https://gerrit.wikimedia.org/r/249399 (owner: 10Subramanya Sastry)
[15:54:14] <jynus>	 so graphite1001 and graphite1002 are sharded?
[15:55:07] <godog>	 no, graphite1002 is being used for tests ATM, hence the light-hearthed reimage
[15:55:15] <jynus>	 ok
[15:55:35] <godog>	 jynus: actually the machine I've used to test linux hybrid ssd/disk caching
[15:56:24] <icinga-wm_>	 PROBLEM - configured eth on graphite1002 is CRITICAL: Connection refused by host
[15:56:44] <icinga-wm_>	 PROBLEM - dhclient process on graphite1002 is CRITICAL: Connection refused by host
[15:56:48] <ori>	 half-hearted good morning
[15:56:55] <icinga-wm_>	 PROBLEM - puppet last run on graphite1002 is CRITICAL: Connection refused by host
[15:57:14] <icinga-wm_>	 PROBLEM - salt-minion processes on graphite1002 is CRITICAL: Connection refused by host
[15:57:14] <icinga-wm_>	 PROBLEM - Disk space on graphite1002 is CRITICAL: Connection refused by host
[15:57:28] <icinga-wm_>	 PROBLEM - RAID on graphite1002 is CRITICAL: Connection refused by host
[15:57:29] <grrrit-wm>	 (03CR) 10Subramanya Sastry: "I ask because of https://www.mediawiki.org/wiki/Parsoid/Deployments#Monday.2C_Nov_9.2C_2015_around_1:15_pm_PT:_b869b084_to_be_deployed" [puppet] - 10https://gerrit.wikimedia.org/r/249399 (owner: 10Subramanya Sastry)
[15:57:45] <icinga-wm_>	 PROBLEM - DPKG on graphite1002 is CRITICAL: Connection refused by host
[15:58:56] <wikibugs>	 10Ops-Access-Requests, 6operations: Requesting access to add perf-roots group to graphite role - https://phabricator.wikimedia.org/T117256#1793735 (10RobH) +1 to @chasemp's proposed rename of the group (no objection to request.)
[16:00:04] <jouncebot>	 anomie ostriches thcipriani marktraceur Krenair: Dear anthropoid, the time has come. Please deploy Morning SWAT(Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20151109T1600).
[16:01:42] <matt_flaschen>	 Here for SWAT
[16:02:21] <icinga-wm_>	 RECOVERY - HTTP error ratio anomaly detection on graphite1001 is OK: OK: No anomaly detected
[16:02:27] <Krenair>	 hi
[16:03:59] <Krenair>	 Krinkle, James_F: ping?
[16:04:05] <Krenair>	 Is bgerstle not here?
[16:04:10] <Krenair>	 bgerstile*
[16:04:32] <wikibugs>	 6operations, 10hardware-requests: Detail codfw snapshot/dataset requirements - https://phabricator.wikimedia.org/T118173#1793760 (10ArielGlenn) It's not going to be straight up duplication.  There are two things at play here:   1) I want to get rid of nfs when we deploy in codfw.  If this seems like it's too h...
[16:04:47] <Krenair>	 apparently I got it right the first time, it's spelt wrong on the calendar
[16:05:23] <Krenair>	 pinged them in -mobile
[16:05:48] <Krenair>	 matt_flaschen, shall we do yours first then?
[16:05:49] <MatmaRex>	 James_F is in a meeting, e here soon
[16:05:56] <wikibugs>	 6operations, 10Dumps-Generation, 10hardware-requests: Detail codfw snapshot/dataset requirements - https://phabricator.wikimedia.org/T118173#1793765 (10ArielGlenn)
[16:06:18] <matt_flaschen>	 Krenair, sure.
[16:06:39] <James_F>	 Krenair: Pong.
[16:08:46] <wikibugs>	 6operations, 10Analytics, 6Analytics-Kanban, 6Discovery, and 8 others: EventBus MVP - https://phabricator.wikimedia.org/T114443#1793775 (10Ottomata) Hi all, I talked to @gwicke a little bit more about this last Thursday.  He impressed upon me a couple of good points I hadn't fully taken in before, and I wa...
[16:10:06] <Krenair>	 Just when I thought we had enough '*erbium's, apparently there's a #Mobile-App-Android-Sprint-70-Ytterbium project in phabricator
[16:10:36] <matt_flaschen>	 Krenair, I love their sprint names.
[16:10:41] <matt_flaschen>	 Mobile in general.
[16:10:44] <wikibugs>	 6operations, 10Dumps-Generation, 10hardware-requests: Detail codfw snapshot/dataset requirements - https://phabricator.wikimedia.org/T118173#1793778 (10ArielGlenn) So @robh can you have a look at the eqiad hw ticket and let's hash that out first?  Then we can use that as the basis for hw in codfw with whatev...
[16:10:56] <_joe_>	 !log removing old builds from copper
[16:11:00] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[16:11:35] <Krenair>	 Oh and a #Mobile-App-Android-Sprint-68-Erbium, I must've missed that one
[16:11:44] <Krenair>	 and #Mobile-App-Sprint-65-Android-Terbium
[16:11:55] <James_F>	 Krenair: They're here to get you.
[16:11:58] <Krenair>	 :)
[16:12:33] <Krenair>	 hm, at some point the 'Sprint $x' and 'Android' got swapped
[16:14:47] <logmsgbot>	 !log krenair@tin Synchronized php-1.27.0-wmf.5/extensions/Flow/modules/mw.flow.Initializer.js: https://gerrit.wikimedia.org/r/#/c/251560/ (duration: 00m 44s)
[16:14:49] <Krenair>	 matt_flaschen, please test ^
[16:14:51] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[16:16:30] <wikibugs>	 10Ops-Access-Requests, 6operations: Requesting access to add perf-roots group to graphite role - https://phabricator.wikimedia.org/T117256#1793801 (10chasemp) >>! In T117256#1793735, @RobH wrote: > +1 to @chasemp's proposed rename of the group (no objection to request.)  I did rename this already fyi
[16:17:45] <grrrit-wm>	 (03CR) 10Filippo Giunchedi: swift: monitor mediawiki originals upload rate (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/251526 (https://phabricator.wikimedia.org/T92322) (owner: 10Filippo Giunchedi)
[16:17:57] <grrrit-wm>	 (03PS2) 10Filippo Giunchedi: swift: monitor mediawiki originals upload rate [puppet] - 10https://gerrit.wikimedia.org/r/251526 (https://phabricator.wikimedia.org/T92322) 
[16:18:24] <Krenair>	 matt_flaschen, ?
[16:18:33] <wikibugs>	 6operations, 10Dumps-Generation, 10hardware-requests: determine hardware needs for dumps in eqiad (boxes out of warranty, capacity planning) - https://phabricator.wikimedia.org/T118154#1793805 (10RobH)
[16:18:57] <matt_flaschen>	 Krenair, I tested, it works but there is a problem.
[16:19:08] <wikibugs>	 6operations, 10Dumps-Generation, 10hardware-requests: determine hardware needs for dumps in eqiad (boxes out of warranty, capacity planning) - https://phabricator.wikimedia.org/T118154#1793806 (10RobH) I've claimed and added this to #hardware-requests.  With the details provided by @ArielGlenn, I'll request...
[16:19:33] <matt_flaschen>	 It makes the edit successfully but doesn't render it, and there is a JS error.  No need to revert, I don't think.  I will fix the remaining issue now.
[16:19:37] <Krenair>	 okay
[16:19:55] <Krenair>	 [config+script] 251677 Enable Flow user opt-in Beta Feature on Wikidata task T116611
[16:20:31] <Krenair>	 matt_flaschen, James_F: Shall we proceed with this? What needs to be run exactly?
[16:21:09] <matt_flaschen>	 FlowUpdateBetaFeaturePreference.php needs to be run after.  I can do that.
[16:21:28] <Krenair>	 Just "mwscript extensions/Flow/maintenance/FlowUpdateBetaFeaturePreference.php wikidatawiki"?
[16:21:32] <Krenair>	 ok, I'll leave that to you
[16:21:55] <grrrit-wm>	 (03CR) 10Alex Monk: [C: 032] Enable Flow user opt-in Beta Feature on Wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/251677 (https://phabricator.wikimedia.org/T116611) (owner: 10Jforrester)
[16:22:55] <grrrit-wm>	 (03Merged) 10jenkins-bot: Enable Flow user opt-in Beta Feature on Wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/251677 (https://phabricator.wikimedia.org/T116611) (owner: 10Jforrester)
[16:24:02] <logmsgbot>	 !log krenair@tin Synchronized wmf-config/InitialiseSettings.php: https://gerrit.wikimedia.org/r/#/c/251677/ (duration: 00m 35s)
[16:24:04] <Krenair>	 matt_flaschen, ^ please test
[16:24:06] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[16:25:47] <matt_flaschen>	 Krenair, works on my page (https://www.wikidata.org/wiki/User_talk:Mattflaschen-WMF).  I'll run the script now.
[16:26:05] <wikibugs>	 10Ops-Access-Requests, 6operations: Requesting access to add perf-roots group to graphite role - https://phabricator.wikimedia.org/T117256#1793813 (10chasemp)
[16:26:57] <grrrit-wm>	 (03CR) 10Alex Monk: [C: 032] Add an apple-app-site-association file used to support iOS deep-linking [mediawiki-config] - 10https://gerrit.wikimedia.org/r/250897 (https://phabricator.wikimedia.org/T111829) (owner: 10Bgerstle)
[16:27:02] <Krenair>	 bgerstle, bd808: hi
[16:27:10] <bgerstle>	 Krenair: o/
[16:27:29] <Krenair>	 this should work for all wikipedia subdomains, right
[16:27:30] <Krenair>	 ?
[16:27:37] <bd808>	 Krenair: yes
[16:27:38] <bgerstle>	 Krenair: correct
[16:27:39] <grrrit-wm>	 (03Merged) 10jenkins-bot: Add an apple-app-site-association file used to support iOS deep-linking [mediawiki-config] - 10https://gerrit.wikimedia.org/r/250897 (https://phabricator.wikimedia.org/T111829) (owner: 10Bgerstle)
[16:27:44] <Krenair>	 Even the non-language-based ones?
[16:27:46] <bgerstle>	 including "m.$lang.wikipedia..."
[16:27:52] <James_F>	 Krenair: Flow/Wikidata LGTM.
[16:28:06] <bgerstle>	 Krenair: not concerned w/ meta/office if that's what you mean
[16:28:15] <Krenair>	 those aren't on wikipedia.org
[16:28:21] <bgerstle>	 yeah sorry, 
[16:28:23] <bd808>	 Krenair: the magic of when it gets used will be embedded in the iOS app itself
[16:28:42] <bgerstle>	 that file in particular, yes
[16:28:58] <Krenair>	 Can the app handle nostalgia, for example?
[16:29:00] <bgerstle>	 apple will download it to the device when the app is installed, if the app has certain "entitlements" in its package
[16:29:14] <bgerstle>	 Krenair: oh right. or simple, i guess?
[16:29:17] <Krenair>	 the arbcom-* subdomains?
[16:29:24] <Krenair>	 yes, simple?
[16:29:46] <bgerstle>	 we're focusing on lang-specific ones, i guess. not much thought has been given to handle those
[16:29:49] <Krenair>	 wg-en? test subdomains?
[16:29:58] <Krenair>	 ten?
[16:30:02] <matt_flaschen>	 !log Ran mwscript extensions/Flow/maintenance/FlowUpdateBetaFeaturePreference.php --wiki=wikidatawiki
[16:30:03] <bgerstle>	 it should def work on test & betalabs. that's the key so we can start prototyping stuff
[16:30:06] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[16:30:25] <Krenair>	 Can it work on those? If not, does it detect this and send the user to their browser?
[16:31:26] <bd808>	 bgerstle: the app will have to register domains to intercept, correct? I think Krenair is mostly worried that things will happen greedily and break things that work today.
[16:31:29] <bgerstle>	 Krenair: it will always open in safari initially
[16:31:38] <bgerstle>	 _if_ there's a site assoc. file for that domain
[16:31:41] <bgerstle>	 _and_ there's markup on the page
[16:31:57] <Krenair>	 bd808, will it register subdomains or *.wikipedia.org?
[16:32:00] <bgerstle>	 _then_ the OS will prompt the user
[16:32:15] <bgerstle>	 Krenair: bd808 only the specific domains we register in the app
[16:32:37] <bgerstle>	 i was planning on writing a script which generates the entitlements (file where this is declared) based on Special:SiteMatrix
[16:32:54] <bgerstle>	 only the wikipedia column, for now at least
[16:32:59] <Krenair>	 I was about to say, if you're registering against subdomains you really need to have a section on the new wiki creation process
[16:33:05] <Krenair>	 But if it's sitematrix based that should be okay
[16:33:39] <bgerstle>	 Krenair: bd808 the main goal for now is to get the file up on beta labs so we can start working on the next steps, i.e. web markup and user flows
[16:34:01] <bgerstle>	 so it's not a blocker if this doesn't work on prod domains (as it shouldn't be used there anyway)
[16:34:02] <wikibugs>	 6operations, 10OTRS: Upgrade OTRS to a more recent stable release - https://phabricator.wikimedia.org/T74109#1793847 (10Steinsplitter) >>! In T74109#1757895, @akosiaris wrote: > I 've upgraded the test installation today to OTRS version 5.0.1. There is one thing that has not been upgraded to version 5 and that...
[16:34:08] <Krenair>	 ... You're aware that this is putting it in production, right?
[16:34:21] <bgerstle>	 it shouldn't have an effect
[16:34:29] <Krenair>	 ok
[16:35:00] <logmsgbot>	 !log krenair@tin Synchronized docroot/wikipedia.org/apple-app-site-association: https://gerrit.wikimedia.org/r/#/c/250897/ (duration: 00m 34s)
[16:35:04] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[16:35:15] <bgerstle>	 i can follow-up when we plan to "roll this out" to other domains so you can watch for potential impact of clients downloading this file
[16:35:16] <Krenair>	 bgerstle: https://en.m.wikipedia.org/apple-app-site-association
[16:35:37] <bgerstle>	 Krenair: so, is this on betalabs now too?
[16:35:40] <bd808>	 Also http://en.wikipedia.beta.wmflabs.org/apple-app-site-association
[16:36:12] <bd808>	 and http://en.m.wikipedia.beta.wmflabs.org/apple-app-site-association
[16:36:15] <bd808>	 lgtm
[16:36:16] <bgerstle>	 👍🏻
[16:38:15] <icinga-wm_>	 PROBLEM - YARN NodeManager Node-State on analytics1032 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[16:38:41] <grrrit-wm>	 (03CR) 10Alex Monk: [C: 032] Enable VisualEditor for draft namespace in fawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/251674 (https://phabricator.wikimedia.org/T118060) (owner: 10Ladsgroup)
[16:39:22] <grrrit-wm>	 (03Merged) 10jenkins-bot: Enable VisualEditor for draft namespace in fawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/251674 (https://phabricator.wikimedia.org/T118060) (owner: 10Ladsgroup)
[16:39:48] <wikibugs>	 6operations, 10OTRS: Upgrade OTRS to a more recent stable release - https://phabricator.wikimedia.org/T74109#1793858 (10akosiaris) >>! In T74109#1793847, @Steinsplitter wrote: >>>! In T74109#1757895, @akosiaris wrote: >> I 've upgraded the test installation today to OTRS version 5.0.1. There is one thing that...
[16:40:04] <icinga-wm_>	 RECOVERY - YARN NodeManager Node-State on analytics1032 is OK: OK: YARN NodeManager analytics1032.eqiad.wmnet:8041 Node-State: RUNNING
[16:40:25] <logmsgbot>	 !log krenair@tin Synchronized wmf-config/InitialiseSettings.php: https://gerrit.wikimedia.org/r/#/c/251674/ (duration: 00m 35s)
[16:40:29] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[16:40:32] <Krenair>	 is one server slow?
[16:40:49] <Krenair>	 James_F, ^
[16:40:56] <James_F>	 Ta.
[16:41:25] <Reedy>	 Isn't it the whole sync of git stuff to codfw that's slow
[16:41:26] <Reedy>	 ?
[16:42:13] <Krenair>	 yes
[16:42:14] <James_F>	 Krenair: Yup, working.
[16:42:58] <wikibugs>	 6operations, 10OTRS: Upgrade OTRS to a more recent stable release - https://phabricator.wikimedia.org/T74109#1793861 (10akosiaris) >>! In T74109#1793858, @akosiaris wrote: >>>! In T74109#1793847, @Steinsplitter wrote: >>>>! In T74109#1757895, @akosiaris wrote: >>> I 've upgraded the test installation today to...
[16:43:33] <grrrit-wm>	 (03PS1) 10Faidon Liambotis: Add cr2-esams to monitoring tools [puppet] - 10https://gerrit.wikimedia.org/r/251983 
[16:43:35] <grrrit-wm>	 (03PS1) 10Faidon Liambotis: network: monitor mr1 OOB links too [puppet] - 10https://gerrit.wikimedia.org/r/251984 
[16:43:37] <Steinsplitter>	 akosiaris: your fast as well :) cool. thans a lot 
[16:44:25] <logmsgbot>	 !log krenair@tin Synchronized php-1.27.0-wmf.5/extensions/VisualEditor/modules/ve-mw/init: https://gerrit.wikimedia.org/r/#/c/251972/ (duration: 00m 34s)
[16:44:29] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[16:44:29] <Krenair>	 James_F, ^
[16:47:07] <James_F>	 Krenair: Looks to be working fine.
[16:47:34] <grrrit-wm>	 (03PS3) 10Alex Monk: Add patroller group to sawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/250454 (https://phabricator.wikimedia.org/T117314) (owner: 10Luke081515)
[16:47:36] <wikibugs>	 6operations, 10Gitblit: Accessing raw link on git.wikimedia.org causes "Error Sorry, the repository mediawiki does not have a extensions branch!" - https://phabricator.wikimedia.org/T118156#1793885 (10Paladox) Please read https://github.com/gitblit/gitblit/issues/949#issuecomment-155110940  According to the au...
[16:47:40] <grrrit-wm>	 (03CR) 10Alexandros Kosiaris: [C: 031] network: monitor mr1 OOB links too [puppet] - 10https://gerrit.wikimedia.org/r/251984 (owner: 10Faidon Liambotis)
[16:48:15] <grrrit-wm>	 (03CR) 10Alexandros Kosiaris: [C: 031] Add cr2-esams to monitoring tools [puppet] - 10https://gerrit.wikimedia.org/r/251983 (owner: 10Faidon Liambotis)
[16:48:17] <grrrit-wm>	 (03CR) 10Alex Monk: [C: 032] Add patroller group to sawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/250454 (https://phabricator.wikimedia.org/T117314) (owner: 10Luke081515)
[16:48:28] <grrrit-wm>	 (03CR) 10Faidon Liambotis: [C: 032] Add cr2-esams to monitoring tools [puppet] - 10https://gerrit.wikimedia.org/r/251983 (owner: 10Faidon Liambotis)
[16:48:36] <grrrit-wm>	 (03CR) 10Faidon Liambotis: [C: 032] network: monitor mr1 OOB links too [puppet] - 10https://gerrit.wikimedia.org/r/251984 (owner: 10Faidon Liambotis)
[16:48:44] <grrrit-wm>	 (03Merged) 10jenkins-bot: Add patroller group to sawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/250454 (https://phabricator.wikimedia.org/T117314) (owner: 10Luke081515)
[16:49:03] <James_F>	 Krenair: … hmm.
[16:49:16] <Krenair>	 James_F, ?
[16:49:51] <James_F>	 Krenair: No, never mind, think I pressed the wrong button.
[16:49:58] <Krenair>	 :)
[16:50:51] <logmsgbot>	 !log krenair@tin Synchronized wmf-config/InitialiseSettings.php: https://gerrit.wikimedia.org/r/#/c/250454/ (duration: 00m 34s)
[16:50:55] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[16:52:54] <Krenair>	 Huh: https://wg-en.m.wikipedia.org/apple-app-site-association
[16:53:05] <Krenair>	 apparently that domains isn't in DNS?
[16:53:14] <Krenair>	 But the non-m version is. Helpful.
[16:53:36] <Krenair>	 Oh well, it's an old locked wiki.
[16:53:40] <Krenair>	 I wonder if there are any more though.
[16:54:12] <YuviPanda>	 _joe_: wasn't the tools-proxies, since toolserver.org is different (just hosts redirects) in a different module
[16:54:39] <chasemp>	 YuviPanda: we tracked it down thanks, a few question on possibly collapsing things but for later :)
[16:54:47] <Krenair>	 Krinkle, you around?
[16:54:47] <Dereckson>	  /win go 36
[16:54:58] <YuviPanda>	 chasemp: kk
[16:55:04] <chasemp>	 YuviPanda: it seems to have just been overzealous requests tanking a one core vm that is still doing a bit of user redirect traffic
[16:55:16] <YuviPanda>	 yeah
[16:55:21] <YuviPanda>	 it's also apache for no reason instead of just nginx
[16:57:01] <grrrit-wm>	 (03CR) 10Sbgujarat: "I have seen this and this is good." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/250454 (https://phabricator.wikimedia.org/T117314) (owner: 10Luke081515)
[17:03:16] <grrrit-wm>	 (03CR) 10NehalDaveND: "This is the best." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/250454 (https://phabricator.wikimedia.org/T117314) (owner: 10Luke081515)
[17:07:14] <wikibugs>	 6operations, 6Discovery, 5codfw-rollout: [EPIC] Set up a CirrusSearch cluster in codfw (Dallas, Texas) - https://phabricator.wikimedia.org/T105703#1793970 (10chasemp) ping'd on irc but, is this epic task now done?  anything remaining?
[17:23:32] <grrrit-wm>	 (03PS1) 10Luke081515: Set throttle exception for account creation on hewiki and commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/252010 (https://phabricator.wikimedia.org/T118122) 
[17:26:20] <grrrit-wm>	 (03CR) 10RobH: "this patchset allows them to delete keys as well, did we want this to allow reinstall as well as initial install? (Less oversight.)" [puppet] - 10https://gerrit.wikimedia.org/r/249483 (https://phabricator.wikimedia.org/T116884) (owner: 10Dzahn)
[17:26:25] <icinga-wm_>	 PROBLEM - puppet last run on netmon1001 is CRITICAL: CRITICAL: puppet fail
[17:34:55] <icinga-wm_>	 PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 39, down: 14, dormant: 0, excluded: 0, unused: 0BRet-0/2/1: down - BRxe-0/1/3: down - BRxe-0/1/0: down - BRxe-0/1/7: down - BRxe-0/1/8: down - BRxe-0/1/9: down - BRxe-0/1/6: down - BRxe-0/1/11: down - BRxe-0/1/4: down - BRxe-0/1/10: down - BRxe-0/1/1: down - BRxe-0/1/2: down - BRxe-0/1/5: down - BRet-0/2/2: down - BR
[17:35:25] <grrrit-wm>	 (03CR) 10Alex Monk: Allow import from any Labs/Beta Cluster project to any other (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/157338 (https://phabricator.wikimedia.org/T17583) (owner: 10TTO)
[17:36:15] <icinga-wm_>	 PROBLEM - HTTP 5xx req/min on graphite1001 is CRITICAL: CRITICAL: 7.69% of data above the critical threshold [500.0]
[17:38:15] <icinga-wm_>	 RECOVERY - HTTP 5xx req/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0]
[17:38:31] <grrrit-wm>	 (03PS10) 10Alex Monk: Allow import from any Labs/Beta Cluster project to any other [mediawiki-config] - 10https://gerrit.wikimedia.org/r/157338 (https://phabricator.wikimedia.org/T17583) (owner: 10TTO)
[17:38:35] <grrrit-wm>	 (03PS1) 10Luke081515: Add new group to enwikiversity [mediawiki-config] - 10https://gerrit.wikimedia.org/r/252012 (https://phabricator.wikimedia.org/T113109) 
[17:40:34] <grrrit-wm>	 (03CR) 10Luke081515: "Please look at the community consensus again, to make sure, that enough users agreed." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/252012 (https://phabricator.wikimedia.org/T113109) (owner: 10Luke081515)
[17:42:01] <grrrit-wm>	 (03CR) 10CSteipp: "> How does this work with CentralAuth...? Can't the attacker just login on another wiki and use autologin to get access to enwiki?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/251678 (owner: 10CSteipp)
[17:45:55] <wikibugs>	 6operations, 6Discovery, 5codfw-rollout: [EPIC] Set up a CirrusSearch cluster in codfw (Dallas, Texas) - https://phabricator.wikimedia.org/T105703#1794100 (10Deskana) 5Open>3Resolved a:3Deskana >>! In T105703#1793970, @chasemp wrote: > ping'd on irc but, is this epic task now done?  anything remaining?...
[17:48:29] <wikibugs>	 6operations, 10ops-codfw: es2010 failed disk (degraded RAID) - https://phabricator.wikimedia.org/T117848#1794111 (10jcrespo) @mark^
[17:50:58] <wikibugs>	 6operations, 10Beta-Cluster-Infrastructure, 7Blocked-on-RelEng, 7HHVM, 5Patch-For-Review: Convert work machines (tin, terbium) to Trusty and hhvm usage - https://phabricator.wikimedia.org/T87036#1794121 (10mmodell) Is this really blocked on #blocked-on-releng?
[17:51:34] <grrrit-wm>	 (03CR) 10Ori.livneh: [C: 031] Allow import from any Labs/Beta Cluster project to any other [mediawiki-config] - 10https://gerrit.wikimedia.org/r/157338 (https://phabricator.wikimedia.org/T17583) (owner: 10TTO)
[17:53:26] <icinga-wm_>	 RECOVERY - puppet last run on netmon1001 is OK: OK: Puppet is currently enabled, last run 23 seconds ago with 0 failures
[17:55:08] <bblack>	 !log upgrading pybal (1.10 -> 1.12) on lvs200[456].codfw.wmnet
[17:55:11] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[17:57:24] <grrrit-wm>	 (03PS4) 10Faidon Liambotis: Switch Central/South Asia to esams [dns] - 10https://gerrit.wikimedia.org/r/239072 
[17:58:31] <grrrit-wm>	 (03PS5) 10Alexandros Kosiaris: rubocop: Ignoring Style/WordArray offense [puppet] - 10https://gerrit.wikimedia.org/r/238778 (https://phabricator.wikimedia.org/T112651) (owner: 10Zfilipin)
[17:58:45] <grrrit-wm>	 (03CR) 10Alexandros Kosiaris: [C: 031] rubocop: Ignoring Style/WordArray offense [puppet] - 10https://gerrit.wikimedia.org/r/238778 (https://phabricator.wikimedia.org/T112651) (owner: 10Zfilipin)
[18:00:03] <grrrit-wm>	 (03CR) 10Alexandros Kosiaris: [C: 032 V: 032] rubocop: Ignoring Style/WordArray offense [puppet] - 10https://gerrit.wikimedia.org/r/238778 (https://phabricator.wikimedia.org/T112651) (owner: 10Zfilipin)
[18:03:14] <godog>	 mobrovac: h-o ?
[18:03:21] <godog>	 nevermind
[18:04:08] <icinga-wm_>	 PROBLEM - Host labtestservices2001 is DOWN: PING CRITICAL - Packet loss = 100%
[18:18:06] <grrrit-wm>	 (03PS11) 10Krinkle: Enable import from any Beta Cluster project to another [mediawiki-config] - 10https://gerrit.wikimedia.org/r/157338 (https://phabricator.wikimedia.org/T17583) (owner: 10TTO)
[18:19:47] <grrrit-wm>	 (03CR) 10Luke081515: [C: 031] Enable import from any Beta Cluster project to another [mediawiki-config] - 10https://gerrit.wikimedia.org/r/157338 (https://phabricator.wikimedia.org/T17583) (owner: 10TTO)
[18:23:59] <grrrit-wm>	 (03CR) 10Krinkle: Enable import from any Beta Cluster project to another (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/157338 (https://phabricator.wikimedia.org/T17583) (owner: 10TTO)
[18:41:12] <grrrit-wm>	 (03CR) 10Dzahn: servermon: add ferm rules for http/https (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/251552 (https://phabricator.wikimedia.org/T105410) (owner: 10Dzahn)
[18:41:14] <grrrit-wm>	 (03PS2) 10Dzahn: servermon: add ferm rules for http/https [puppet] - 10https://gerrit.wikimedia.org/r/251552 (https://phabricator.wikimedia.org/T105410) 
[18:41:59] <grrrit-wm>	 (03PS3) 10Dzahn: librenms: add ferm rules for http/https [puppet] - 10https://gerrit.wikimedia.org/r/251550 (https://phabricator.wikimedia.org/T105410) 
[18:42:13] <grrrit-wm>	 (03CR) 10Dzahn: [C: 032] librenms: add ferm rules for http/https [puppet] - 10https://gerrit.wikimedia.org/r/251550 (https://phabricator.wikimedia.org/T105410) (owner: 10Dzahn)
[18:42:43] <jynus>	 !log restarting mysql in db1060 to test new performance configuration
[18:42:47] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[18:43:23] <grrrit-wm>	 (03CR) 10Dzahn: "as opposed to servermon, this is also on 443." [puppet] - 10https://gerrit.wikimedia.org/r/251550 (https://phabricator.wikimedia.org/T105410) (owner: 10Dzahn)
[18:44:04] <grrrit-wm>	 (03Abandoned) 10Dzahn: dns::recursor: move 'standard' and v6 IP to role [puppet] - 10https://gerrit.wikimedia.org/r/250616 (owner: 10Dzahn)
[18:44:11] <MatmaRex>	 in context of the train deployments, Commons is in group0 (All non-Wikipedia sites), not group1 (All Wikipedias), correct?
[18:44:30] <MatmaRex>	 (i know that some tools consider it a Wikipedia with language "commons", so asking to be sure)
[18:49:05] <Krenair>	 I don't think group0 is all non-wikipedia sites, is it?
[18:49:55] <Krenair>	 group 0: MediaWiki.org; test.wikipedia.org; test2.wikipedia.org; test.wikidata.org; zero.wikimedia.org
[18:50:04] <Krenair>	 group 1: All non-Wikipedia sites
[18:50:11] <Krenair>	 group 2: All Wikipedias
[18:50:54] <MatmaRex>	 sorry, s/1/2/, s/0/1/
[18:51:27] <MatmaRex>	 so, what is the answer to the corrected question? :)
[18:51:34] <Krenair>	 Some tools consider commons to be a wikipedia because it ends with that 'wiki' prefix
[18:51:45] <greg-g>	 commons is in group1
[18:51:49] <greg-g>	 ie: wednesday
[18:52:14] <greg-g>	 https://wikitech.wikimedia.org/wiki/Deployments/One_week
[18:52:18] <greg-g>	 answers such questions :)
[18:52:22] <Krenair>	 sorry, suffix. not prefix.
[18:52:26] <Krenair>	 historical reasons probably
[18:52:35] <Krenair>	 at some point there was just wikipedia
[18:52:42] <Krenair>	 and now we can't rename databases
[18:52:48] <Krenair>	 so we're stuck with it
[18:53:31] <Krenair>	 honestly one of the things I hate most about our particular configuration of mediawiki
[18:56:06] <jynus>	 and I just made a query 300x faster
[18:56:21] <apergos>	 commonswiki.  wikidatawiki.  clearly wikipedias 
[18:56:40] <apergos>	 (right. more food, less snark)
[18:57:42] <Krenair>	 my favourite example is sourceswiki
[18:58:15] <Krenair>	 guess the domain, and guess what site a bunch of our software thinks it is
[18:58:33] <grrrit-wm>	 (03PS2) 10Muehlenhoff: Assign salt grains for cp* hosts [puppet] - 10https://gerrit.wikimedia.org/r/251949 
[18:58:33] <greg-g>	 jynus: awesome way to start the week!
[18:59:15] <Krenair>	 well, 'favourite'
[18:59:57] <icinga-wm_>	 PROBLEM - puppet last run on lvs3001 is CRITICAL: CRITICAL: puppet fail
[19:01:20] <grrrit-wm>	 (03CR) 10Muehlenhoff: [C: 032 V: 032] Assign salt grains for cp* hosts [puppet] - 10https://gerrit.wikimedia.org/r/251949 (owner: 10Muehlenhoff)
[19:01:45] <grrrit-wm>	 (03PS1) 10BryanDavis: Monolog: Disable microsecond timestamps on all loggers [mediawiki-config] - 10https://gerrit.wikimedia.org/r/252017 (https://phabricator.wikimedia.org/T116550) 
[19:06:53] <grrrit-wm>	 (03PS1) 10Muehlenhoff: Use role spare for calcium [puppet] - 10https://gerrit.wikimedia.org/r/252018 
[19:07:49] <grrrit-wm>	 (03CR) 10Dzahn: [C: 032] Use role spare for calcium [puppet] - 10https://gerrit.wikimedia.org/r/252018 (owner: 10Muehlenhoff)
[19:09:07] <icinga-wm_>	 PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 11 data above and 9 below the confidence bounds
[19:12:02] <wikibugs>	 6operations: reclaim calcium to spares - https://phabricator.wikimedia.org/T116790#1794376 (10Dzahn) also see T105553 and T83044
[19:12:29] <grrrit-wm>	 (03PS2) 10BryanDavis: Monolog: Disable microsecond timestamps on all loggers [mediawiki-config] - 10https://gerrit.wikimedia.org/r/252017 (https://phabricator.wikimedia.org/T116550) 
[19:17:45] <grrrit-wm>	 (03PS1) 10Muehlenhoff: Assign salt grains for analytics::mysql::meta role [puppet] - 10https://gerrit.wikimedia.org/r/252020 
[19:18:55] <grrrit-wm>	 (03PS2) 10RobH: admin: hoo and jzerebecki for wdqs admins [puppet] - 10https://gerrit.wikimedia.org/r/249027 (https://phabricator.wikimedia.org/T116702) (owner: 10Dzahn)
[19:19:49] <grrrit-wm>	 (03CR) 10RobH: [C: 032] admin: hoo and jzerebecki for wdqs admins [puppet] - 10https://gerrit.wikimedia.org/r/249027 (https://phabricator.wikimedia.org/T116702) (owner: 10Dzahn)
[19:21:18] <wikibugs>	 10Ops-Access-Requests, 6operations, 5Patch-For-Review: wdqs-admin group membership for Marius Hoch (hoo) and Jan Zerebecki - https://phabricator.wikimedia.org/T116702#1794389 (10RobH) 5Open>3Resolved a:3RobH This was approved in the operations meeting, and has been merged live.  Once the affected hosts...
[19:21:37] <grrrit-wm>	 (03CR) 10Dzahn: "the sshd config part comes from this in hiera:" [puppet] - 10https://gerrit.wikimedia.org/r/250659 (owner: 10Muehlenhoff)
[19:22:06] <grrrit-wm>	 (03PS2) 10Muehlenhoff: Assign salt grains for analytics::mysql::meta role [puppet] - 10https://gerrit.wikimedia.org/r/252020 
[19:22:13] <grrrit-wm>	 (03CR) 10Muehlenhoff: [C: 032 V: 032] Assign salt grains for analytics::mysql::meta role [puppet] - 10https://gerrit.wikimedia.org/r/252020 (owner: 10Muehlenhoff)
[19:22:33] <grrrit-wm>	 (03CR) 10Dzahn: ".. and only now that would be actually applied because the role keyword was not first before. but maybe we can meanwhile remove these exce" [puppet] - 10https://gerrit.wikimedia.org/r/250659 (owner: 10Muehlenhoff)
[19:23:31] <mutante>	 andrewbogott: is this (still) true?  "paramiko needs to to ssh into silver to support designate"
[19:23:50] <YuviPanda>	 yes
[19:23:57] <YuviPanda>	 (afaict)
[19:24:19] <YuviPanda>	 oh lol, maybe it didn't work so far?
[19:24:22] <YuviPanda>	 hmm
[19:24:25] <mutante>	 YuviPanda: ok, thanks. the comment said "ssh into these" but it is in role/common/nova/controller.yaml
[19:24:32] <mutante>	 yes, it didnt work so far
[19:24:52] <mutante>	 and now we can either make it work or remove the exception for it as opposed to regular sshd
[19:24:56] <YuviPanda>	 right
[19:25:01] <YuviPanda>	 I guess andrewbogott will know more
[19:25:05] <mutante>	 *nod* ok
[19:25:08] <icinga-wm_>	 RECOVERY - puppet last run on lvs3001 is OK: OK: Puppet is currently enabled, last run 49 seconds ago with 0 failures
[19:27:09] <grrrit-wm>	 (03PS1) 10Bmansurov: Enable RelatedArticles on beta labs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/252022 
[19:30:48] <grrrit-wm>	 (03PS2) 10Bmansurov: Enable RelatedArticles on beta labs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/252022 (https://phabricator.wikimedia.org/T116707) 
[19:34:40] <wikibugs>	 10Ops-Access-Requests, 6operations: Requesting access to rest base and cassandra nodes - https://phabricator.wikimedia.org/T117473#1794429 (10RobH) a:3Nuria @Nuria: Do you know what group grants the permissions required?  I don't see any aqs named groups, other than the aqs-admins, which is allowing restarti...
[19:35:25] <grrrit-wm>	 (03PS3) 10Dzahn: admin: let dc-ops sign puppet certs, add salt keys [puppet] - 10https://gerrit.wikimedia.org/r/249483 (https://phabricator.wikimedia.org/T116884) 
[19:36:26] <grrrit-wm>	 (03PS4) 10Dzahn: admin: let dc-ops sign puppet certs, add salt keys [puppet] - 10https://gerrit.wikimedia.org/r/249483 (https://phabricator.wikimedia.org/T116884) 
[19:36:42] <grrrit-wm>	 (03PS1) 10Jcrespo: Pool db1060 after performance patch [mediawiki-config] - 10https://gerrit.wikimedia.org/r/252023 
[19:38:09] <grrrit-wm>	 (03PS2) 10Jcrespo: Pool db1060 after performance patch [mediawiki-config] - 10https://gerrit.wikimedia.org/r/252023 
[19:38:11] <grrrit-wm>	 (03CR) 10Dzahn: [C: 032] "approved in ops meeting" [puppet] - 10https://gerrit.wikimedia.org/r/249483 (https://phabricator.wikimedia.org/T116884) (owner: 10Dzahn)
[19:39:11] <grrrit-wm>	 (03CR) 10Jcrespo: [C: 032] Pool db1060 after performance patch [mediawiki-config] - 10https://gerrit.wikimedia.org/r/252023 (owner: 10Jcrespo)
[19:40:50] <logmsgbot>	 !log jynus@tin Synchronized wmf-config/db-eqiad.php: Pool db1060 after performance patch (duration: 00m 35s)
[19:40:54] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[19:41:01] <wikibugs>	 6operations, 7Monitoring: Migrate monitoring alerts from watchmouse to catchpoint - https://phabricator.wikimedia.org/T107092#1794447 (10RobH) 5stalled>3declined We have alerting via email for catchpoint, and these monitor different things (plus nimsoft wasn't reliable.)  As such, I'm going to decline this...
[19:41:18] <wikibugs>	 10Ops-Access-Requests, 6operations, 5Patch-For-Review: let datacenter-ops sign puppet certs and accept salt keys - https://phabricator.wikimedia.org/T116884#1794449 (10Dzahn) was approved in ops meeting. merged.  on palladium:  +%datacenter-ops ALL = NOPASSWD: /usr/bin/salt-key * +%datacenter-ops ALL = NOPAS...
[19:41:25] <wikibugs>	 10Ops-Access-Requests, 6operations, 5Patch-For-Review: let datacenter-ops sign puppet certs and accept salt keys - https://phabricator.wikimedia.org/T116884#1794452 (10Dzahn) a:3Dzahn
[19:41:37] <wikibugs>	 6operations, 6Phabricator: Moving procurement from RT to Phabricator - https://phabricator.wikimedia.org/T93760#1794455 (10RobH) 5Open>3Resolved I resolved the last RT procurement tickets today, and now we use phabricator for all procurement.
[19:43:15] <wikibugs>	 10Ops-Access-Requests, 6operations, 5Patch-For-Review: let datacenter-ops sign puppet certs and accept salt keys - https://phabricator.wikimedia.org/T116884#1760888 (10Dzahn) @papaul  see above, when you are back from your vacation and get to install a server again, you can now do the puppet and salt part. l...
[19:43:21] <wikibugs>	 10Ops-Access-Requests, 6operations, 5Patch-For-Review: let datacenter-ops sign puppet certs and accept salt keys - https://phabricator.wikimedia.org/T116884#1794466 (10Dzahn) 5Open>3Resolved
[19:44:26] <icinga-wm_>	 RECOVERY - Host labtestservices2001 is UP: PING OK - Packet loss = 0%, RTA = 35.62 ms
[19:46:35] <wikibugs>	 6operations, 6Performance-Team, 10Traffic: Support HTTP/2 - https://phabricator.wikimedia.org/T96848#1794476 (10ori)
[19:46:50] <Krenair>	 robh, hey, so what's left in RT?
[19:46:56] <wikibugs>	 6operations, 10ops-ulsfo: populate spares data for ulsfo - https://phabricator.wikimedia.org/T118207#1794478 (10RobH) 3NEW a:3RobH
[19:48:08] <icinga-wm_>	 PROBLEM - puppet last run on labtestservices2001 is CRITICAL: Connection refused by host
[19:48:08] <icinga-wm_>	 PROBLEM - Disk space on labtestservices2001 is CRITICAL: Connection refused by host
[19:48:26] <icinga-wm_>	 PROBLEM - salt-minion processes on labtestservices2001 is CRITICAL: Connection refused by host
[19:48:48] <icinga-wm_>	 PROBLEM - RAID on labtestservices2001 is CRITICAL: Connection refused by host
[19:49:17] <icinga-wm_>	 PROBLEM - configured eth on labtestservices2001 is CRITICAL: Connection refused by host
[19:49:37] <icinga-wm_>	 PROBLEM - dhclient process on labtestservices2001 is CRITICAL: Connection refused by host
[19:49:37] <icinga-wm_>	 PROBLEM - DPKG on labtestservices2001 is CRITICAL: Connection refused by host
[19:49:41] <robh>	 Krenair: maint-announce
[19:49:49] <robh>	 https://phabricator.wikimedia.org/T118176
[19:50:00] <Krenair>	 is that for the datacenters to send announcements to?
[19:50:07] <robh>	 Once that is done we'll kill the RT mail relays and shove it into a ganeti VM
[19:50:14] <robh>	 and carriers
[19:50:23] <robh>	 datacenter and carrier vendors.
[19:50:37] <Krenair>	 you're going to move RT to a VM?
[19:50:45] <Krenair>	 what will it actually have other than historical records?
[19:50:48] <chasemp>	 labtestservices2001 is me I guess, it shoudn't be alerting
[19:50:49] <robh>	 we cannot kill it entirely as not every ticket was migrated
[19:50:57] <robh>	 and we may need it to reference past purchase approvals
[19:51:03] <robh>	 (as we never migrated the old procurement queue)
[19:51:16] <Krenair>	 so you're going to keep it alive indefinitely? :/
[19:51:25] <robh>	 not my idea.
[19:51:26] <wikibugs>	 6operations, 6Performance-Team, 10Traffic: Support HTTP/2 - https://phabricator.wikimedia.org/T96848#1794505 (10Peter) FYI: I've been filing bugs for SPDY/Chrome and WebPageTest and Pat Meenan (of the Chrome team) reminded me that Chrome will drop support for SPDY early 2016. He also said the team will reach...
[19:51:29] <Krenair>	 heh, ok
[19:52:18] <robh>	 so ideally someone imports the data and marks in resolved or declined(for rejected tickets)
[19:52:26] <robh>	 or we'll strip it to html like bz
[19:52:44] <robh>	 its unclear if one of those will happen, or if its a vm with full RT stack minus mail
[19:55:03] <grrrit-wm>	 (03PS6) 10Alexandros Kosiaris: rubocop: Ignoring Style/WordArray offense [puppet] - 10https://gerrit.wikimedia.org/r/238778 (https://phabricator.wikimedia.org/T112651) (owner: 10Zfilipin)
[19:55:10] <grrrit-wm>	 (03CR) 10Alexandros Kosiaris: [V: 032] rubocop: Ignoring Style/WordArray offense [puppet] - 10https://gerrit.wikimedia.org/r/238778 (https://phabricator.wikimedia.org/T112651) (owner: 10Zfilipin)
[19:55:44] <grrrit-wm>	 (03PS3) 10Alexandros Kosiaris: Update servermon configuration for 0.7 [puppet] - 10https://gerrit.wikimedia.org/r/223347 
[19:56:59] <andrewbogott>	 YuviPanda, mutante, sorry will catch up after meeting
[19:58:25] <grrrit-wm>	 (03PS5) 10BryanDavis: Prepare to enable QuickSurveys in production [mediawiki-config] - 10https://gerrit.wikimedia.org/r/251130 (https://phabricator.wikimedia.org/T110661) (owner: 10Jdlrobson)
[19:59:41] <wikibugs>	 6operations, 6Performance-Team, 10Traffic: Support HTTP/2 - https://phabricator.wikimedia.org/T96848#1794518 (10Krinkle) p:5Normal>3High
[20:01:10] <grrrit-wm>	 (03CR) 10Alexandros Kosiaris: [C: 032] Update servermon configuration for 0.7 [puppet] - 10https://gerrit.wikimedia.org/r/223347 (owner: 10Alexandros Kosiaris)
[20:01:12] <grrrit-wm>	 (03PS1) 10Rush: labtest*.codfw.wmnet definitions [dns] - 10https://gerrit.wikimedia.org/r/252025 (https://phabricator.wikimedia.org/T117097) 
[20:01:28] <icinga-wm_>	 PROBLEM - Host labtestservices2001 is DOWN: PING CRITICAL - Packet loss = 100%
[20:04:12] <grrrit-wm>	 (03CR) 10Andrew Bogott: [C: 031] "I keep reading 'metal' as 'meta'" [dns] - 10https://gerrit.wikimedia.org/r/252025 (https://phabricator.wikimedia.org/T117097) (owner: 10Rush)
[20:04:26] <grrrit-wm>	 (03CR) 10Jhobs: [C: 031] "Config change to enable on testwiki (and enable first survey) to come after this rides the train on Tuesday." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/251130 (https://phabricator.wikimedia.org/T110661) (owner: 10Jdlrobson)
[20:05:33] <grrrit-wm>	 (03CR) 10BryanDavis: [C: 031] "Changed this patch to only setup the extension for l10n and keep it disabled everywhere. Migrates things from being setup only for beta cl" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/251130 (https://phabricator.wikimedia.org/T110661) (owner: 10Jdlrobson)
[20:51:28] <grrrit-wm>	 (03PS1) 10Muehlenhoff: Assign salt grains for logstash elastic nodes [puppet] - 10https://gerrit.wikimedia.org/r/252106 
[20:52:30] <grrrit-wm>	 (03PS2) 10Muehlenhoff: Assign salt grains for logstash elastic nodes [puppet] - 10https://gerrit.wikimedia.org/r/252106 
[20:52:52] <grrrit-wm>	 (03CR) 10Muehlenhoff: [C: 032 V: 032] Assign salt grains for logstash elastic nodes [puppet] - 10https://gerrit.wikimedia.org/r/252106 (owner: 10Muehlenhoff)
[20:56:47] <wikibugs>	 6operations, 6Performance-Team, 10Traffic: Support HTTP/2 - https://phabricator.wikimedia.org/T96848#1794563 (10ori) @Peter, could you perhaps connect Pat Meenan and @BBlack, so we can ask the Chrome team to not drop SPDY before the Nginx situation is resolved? (And perhaps to use their clout to ask Nginx to...
[21:00:04] <jouncebot>	 gwicke cscott arlolra subbu bearND mdholloway: Respected human, time to deploy Services – Parsoid / OCG / Citoid / Mobileapps / … (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20151109T2100). Please do the needful.
[21:00:39] <James_F>	 No deploy today.
[21:00:41] <Reedy>	 lol.
[21:00:52] <subbu>	 no deploys
[21:01:17] <moritzm>	 what's the deal with eventlog2001, it's no longer in site.pp, but e.g. shows up in servermon?
[21:01:17] <Krenair>	 heh
[21:11:22] <grrrit-wm>	 (03PS1) 10Muehlenhoff: Assign salt grains for labs::db::slave and labs::db::master roles [puppet] - 10https://gerrit.wikimedia.org/r/252112 
[21:12:34] <wikibugs>	 6operations, 6Performance-Team, 10Traffic: Support HTTP/2 - https://phabricator.wikimedia.org/T96848#1794583 (10ori) @BBlack: https://grafana.wikimedia.org/dashboard/db/client-connections?panelId=5&fullscreen&from=1446498648837&to=1447103448837 shows support for SPDY ranging from ~62% to ~71%, which is at od...
[21:12:41] <grrrit-wm>	 (03CR) 10Muehlenhoff: [C: 032 V: 032] Assign salt grains for labs::db::slave and labs::db::master roles [puppet] - 10https://gerrit.wikimedia.org/r/252112 (owner: 10Muehlenhoff)
[21:15:32] <wikibugs>	 6operations, 6Performance-Team, 10Traffic: Support HTTP/2 - https://phabricator.wikimedia.org/T96848#1794586 (10BBlack) @ori: my stats are from clienthellos, so they're per-**connection**.  All of our stats based on X-Connection-Properties are (in some cases, unfortunately) per-**request**, even if several o...
[21:16:20] <ori>	 ottomata: statsv does not appear to work at the moment, but the service is up. are you aware of any issues w/kafka?
[21:17:17] <ottomata>	 hm, ori, moritzm and I enabled ferm on kafka1014 today, and in doing so we did a broker restart
[21:17:40] <moritzm>	 ori, ottomata: let me have a look at the logs
[21:17:42] <ottomata>	 ori, do you remember if you are using kafka-python or python-kafka
[21:17:43] <ottomata>	 ?
[21:17:48] <ottomata>	 sorry
[21:17:48] <ottomata>	 or
[21:17:49] <ottomata>	 pykafka*
[21:18:10] <ottomata>	 we are using pykafka for consumption in eventlogging now, and I had to restart some of the consumers when we did this
[21:18:12] <ottomata>	 and i'm not sure why
[21:18:14] <ori>	 python-pykafka i think
[21:18:30] <moritzm>	 ottomata: I doubt it's related to the ferm rules, nothing got dropped traffic-wise throughout the day
[21:18:35] <ori>	 should i try restarting it, or would you like me to keep it running in its current state?
[21:18:49] <ottomata>	 lemme look real quick, then we can restart
[21:18:52] <ottomata>	 where is it running?
[21:18:58] <ori>	 hafnium
[21:19:19] <ottomata>	 oh jessie now..right! :)
[21:20:35] <wikibugs>	 6operations, 10hardware-requests: Allocate hardware for salt master in eqiad - https://phabricator.wikimedia.org/T115288#1794593 (10RobH) a:5mark>3RobH
[21:21:14] <ottomata>	 ori, am looking at syslog there, and I see Nov  9 14:21:07 hafnium python[16995]: CalledProcessError: Command '['phantomjs', '--ssl-protocol=any', 'asset-check.js', 'https://zh.wikipedia.org/?mainpage']' returned non-zero exit status 1
[21:21:35] <ori>	 that's not statsv
[21:21:55] <ottomata>	 hm
[21:21:57] <ottomata>	 k
[21:22:20] <ottomata>	 hard to separate them out in syslog...
[21:22:27] <ori>	 let me strace it and see what it's actually doing
[21:22:46] <ottomata>	 service statsv status gives a little info
[21:22:47] <ottomata>	 maybe
[21:25:24] <ori>	 waiting on kafka
[21:25:25] <ori>	 it looks like
[21:25:29] <ori>	 reading but nothing is coming
[21:25:44] <ottomata>	 yeah, we saw this once with pykafka too..
[21:25:49] <ottomata>	 there is an issue i think
[21:25:51] <ottomata>	 go ahead and restart
[21:26:22] <ottomata>	 https://github.com/Parsely/pykafka/issues/189
[21:29:28] <grrrit-wm>	 (03CR) 10Andrew Bogott: [C: 032] labtest*.codfw.wmnet definitions [dns] - 10https://gerrit.wikimedia.org/r/252025 (https://phabricator.wikimedia.org/T117097) (owner: 10Rush)
[21:30:10] <grrrit-wm>	 (03PS5) 10RobH: Add perf-roots to Graphite role [puppet] - 10https://gerrit.wikimedia.org/r/249966 (owner: 10Ori.livneh)
[21:30:17] <icinga-wm_>	 RECOVERY - HTTP error ratio anomaly detection on graphite1001 is OK: OK: No anomaly detected
[21:31:24] <grrrit-wm>	 (03CR) 10RobH: [C: 032] "approved in ops meeting." [puppet] - 10https://gerrit.wikimedia.org/r/249966 (owner: 10Ori.livneh)
[21:31:48] <wikibugs>	 10Ops-Access-Requests, 6operations: Requesting access to add perf-roots group to graphite role - https://phabricator.wikimedia.org/T117256#1794612 (10RobH) 5Open>3Resolved a:3RobH This was approved in our operations meeting and is now merged live.  It may take up to 30 minutes for affected hosts to call...
[21:37:53] <grrrit-wm>	 (03PS1) 10Ori.livneh: Add perf-roots to webperf role (as part of I583d9a571) [puppet] - 10https://gerrit.wikimedia.org/r/252114 
[21:39:12] <grrrit-wm>	 (03CR) 10Alexandros Kosiaris: "@Subbu, yes Nov 4" [puppet] - 10https://gerrit.wikimedia.org/r/249399 (owner: 10Subramanya Sastry)
[21:39:36] <subbu>	 akosiaris, thanks
[21:40:13] <akosiaris>	 subbu: you 're welcome 
[21:49:17] <icinga-wm_>	 PROBLEM - Unmerged changes on repository puppet on strontium is CRITICAL: There is one unmerged change in puppet (dir /var/lib/git/operations/puppet).
[21:50:12] <ottomata>	 hmm, ori, i think you are using kafka-python?
[21:50:17] <grrrit-wm>	 (03PS1) 10Muehlenhoff: Assign salt grains for db::redis role [puppet] - 10https://gerrit.wikimedia.org/r/252116 
[21:50:24] <ottomata>	 yes, which commits offsets to zk instead of kafka, (old way)
[21:51:25] <ottomata>	 ori, fyi, we recently installed burrow on krypton, and if you us pykafka and commit offsets to kafka, you can get monitoring about consumer lag and status
[21:51:26] <wikibugs>	 6operations: install/setup/deploy neodymium as salt-master in eqiad - https://phabricator.wikimedia.org/T118210#1794623 (10RobH) 3NEW a:3RobH
[21:51:38] <ottomata>	 e.g. curl http://krypton.eqiad.wmnet:8000/v2/kafka/eqiad/consumer
[21:51:39] <wikibugs>	 6operations: install/setup/deploy neodymium as salt-master in eqiad - https://phabricator.wikimedia.org/T118210#1794633 (10RobH)
[21:51:40] <wikibugs>	 6operations, 10hardware-requests: Allocate hardware for salt master in eqiad - https://phabricator.wikimedia.org/T115288#1720246 (10RobH)
[21:51:52] <ottomata>	 and  curl http://krypton.eqiad.wmnet:8000/v2/kafka/eqiad/consumer/eventlogging-00/topic/eventlogging-client-side
[21:52:12] <bearND>	 !log MobileApps deployed sha1 6c63984
[21:52:16] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[21:52:26] <wikibugs>	 6operations, 10Salt: Move salt master to separate host from puppet master - https://phabricator.wikimedia.org/T115287#1794641 (10RobH)
[21:52:27] <wikibugs>	 6operations: install/setup/deploy lawrencium as eqiad salt-master - https://phabricator.wikimedia.org/T116645#1794640 (10RobH)
[21:52:28] <wikibugs>	 6operations: install/setup/deploy neodymium as salt-master in eqiad - https://phabricator.wikimedia.org/T118210#1794623 (10RobH)
[21:52:30] <wikibugs>	 6operations, 10hardware-requests: Allocate hardware for salt master in eqiad - https://phabricator.wikimedia.org/T115288#1794635 (10RobH) 5Open>3Resolved Well, neodymium has SSDs, but the OS will be placed on the larger SAS disks.  All of the possible allocations are slightly out of spec, this one being we...
[21:54:25] <ori>	 ottomata: so what do i need to do (to make statsv compliant)
[21:54:59] <ottomata>	 ori, just used pykafka instead of kafka-python (pykafka is better) (although I don't know if it would have avoided the problem you just poked me about)
[21:55:27] <ottomata>	 http://pykafka.readthedocs.org/en/latest/
[21:55:54] <ottomata>	 esp. now that hafnium is on jessie, you have a debian python-pykafka package you can use
[21:56:17] <ottomata>	 ori, this is balanced consumer, which you might not need:
[21:56:17] <ottomata>	 https://github.com/wikimedia/mediawiki-extensions-EventLogging/blob/master/server/eventlogging/handlers.py#L471
[21:56:30] <grrrit-wm>	 (03PS2) 10Muehlenhoff: Assign salt grains for db::redis role [puppet] - 10https://gerrit.wikimedia.org/r/252116 
[21:56:43] <grrrit-wm>	 (03PS1) 10RobH: setting neodymium production dns entries [dns] - 10https://gerrit.wikimedia.org/r/252117 
[21:57:24] <grrrit-wm>	 (03CR) 10Muehlenhoff: [C: 032 V: 032] Assign salt grains for db::redis role [puppet] - 10https://gerrit.wikimedia.org/r/252116 (owner: 10Muehlenhoff)
[21:58:47] <icinga-wm_>	 RECOVERY - Unmerged changes on repository puppet on strontium is OK: No changes to merge.
[22:15:13] <grrrit-wm>	 (03PS1) 10RobH: setting install params for neodymium [puppet] - 10https://gerrit.wikimedia.org/r/252119 
[22:16:25] <grrrit-wm>	 (03CR) 10RobH: [C: 032] setting neodymium production dns entries [dns] - 10https://gerrit.wikimedia.org/r/252117 (owner: 10RobH)
[22:16:50] <grrrit-wm>	 (03PS2) 10RobH: setting install params for neodymium [puppet] - 10https://gerrit.wikimedia.org/r/252119 
[22:17:55] <grrrit-wm>	 (03CR) 10RobH: [C: 032] setting install params for neodymium [puppet] - 10https://gerrit.wikimedia.org/r/252119 (owner: 10RobH)
[22:23:10] <apergos>	 woo hoo!
[22:23:37] <icinga-wm_>	 PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 10 data above and 1 below the confidence bounds
[22:25:07] <icinga-wm_>	 PROBLEM - Kafka Broker Replica Max Lag on kafka1012 is CRITICAL: CRITICAL: 66.67% of data above the critical threshold [5000000.0]
[22:27:18] <icinga-wm_>	 PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 11 data above and 2 below the confidence bounds
[22:30:57] <grrrit-wm>	 (03CR) 10Dzahn: "i don't really have an opinion here. can you add a commit message that explains a bit what is being changed here" [puppet] - 10https://gerrit.wikimedia.org/r/251836 (owner: 10Paladox)
[22:31:07] <icinga-wm_>	 RECOVERY - Host labtestservices2001 is UP: PING OK - Packet loss = 0%, RTA = 34.85 ms
[22:32:20] <grrrit-wm>	 (03CR) 10Dzahn: [C: 031] "i guess.. for consistency" [puppet] - 10https://gerrit.wikimedia.org/r/251714 (https://phabricator.wikimedia.org/T115067) (owner: 10JanZerebecki)
[22:32:36] <icinga-wm_>	 PROBLEM - puppet last run on mw1172 is CRITICAL: CRITICAL: Puppet has 1 failures
[22:33:27] <icinga-wm_>	 PROBLEM - RAID on analytics1032 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[22:33:36] <icinga-wm_>	 PROBLEM - YARN NodeManager Node-State on analytics1032 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[22:34:17] <icinga-wm_>	 PROBLEM - configured eth on analytics1032 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[22:35:34] <wikibugs>	 6operations: install/setup/deploy neodymium as salt-master in eqiad - https://phabricator.wikimedia.org/T118210#1794706 (10RobH)
[22:35:57] <icinga-wm_>	 RECOVERY - configured eth on analytics1032 is OK: OK - interfaces up
[22:36:57] <icinga-wm_>	 PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 12 data above and 3 below the confidence bounds
[22:37:07] <icinga-wm_>	 RECOVERY - RAID on analytics1032 is OK: OK: optimal, 13 logical, 14 physical
[22:37:17] <icinga-wm_>	 RECOVERY - YARN NodeManager Node-State on analytics1032 is OK: OK: YARN NodeManager analytics1032.eqiad.wmnet:8041 Node-State: RUNNING
[22:37:38] <wikibugs>	 6operations, 6Discovery, 5codfw-rollout: [EPIC] Set up a CirrusSearch cluster in codfw (Dallas, Texas) - https://phabricator.wikimedia.org/T105703#1794708 (10EBernhardson)
[22:38:02] <wikibugs>	 6operations, 6Discovery, 5codfw-rollout: [EPIC] Set up a CirrusSearch cluster in codfw (Dallas, Texas) - https://phabricator.wikimedia.org/T105703#1449703 (10EBernhardson) This is now up and running with a full copy of the index and all writes going to it. We should do a load test and ensure this meets our e...
[22:38:27] <grrrit-wm>	 (03Abandoned) 10Dzahn: logstash::elasticsearch add base::firewall [puppet] - 10https://gerrit.wikimedia.org/r/248918 (https://phabricator.wikimedia.org/T104964) (owner: 10Dzahn)
[22:38:28] <icinga-wm_>	 RECOVERY - Kafka Broker Replica Max Lag on kafka1012 is OK: OK: Less than 1.00% above the threshold [1000000.0]
[22:39:14] <grrrit-wm>	 (03CR) 10Dzahn: "i don't know the current status of this anymore. wish i would" [dns] - 10https://gerrit.wikimedia.org/r/248504 (https://phabricator.wikimedia.org/T599) (owner: 10Dzahn)
[22:41:24] <wikibugs>	 6operations: reclaim rubidium to spares - https://phabricator.wikimedia.org/T118213#1794720 (10RobH) 3NEW a:3RobH
[22:42:23] <wikibugs>	 6operations: reclaim rubidium to spares - https://phabricator.wikimedia.org/T118213#1794732 (10RobH)
[22:42:36] <grrrit-wm>	 (03PS3) 10Dzahn: servermon: add ferm rules for http/https [puppet] - 10https://gerrit.wikimedia.org/r/251552 (https://phabricator.wikimedia.org/T105410) 
[22:42:48] <grrrit-wm>	 (03PS1) 10EBernhardson: Enable CirrusSearch writes to labsearch for all but enwiki and dewiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/252120 
[22:42:50] <grrrit-wm>	 (03PS1) 10EBernhardson: Enable CirrusSearch writes to enwiki and dewiki as well [mediawiki-config] - 10https://gerrit.wikimedia.org/r/252121 
[22:42:54] <grrrit-wm>	 (03CR) 10Dzahn: [C: 032] "no-op until base::firewall" [puppet] - 10https://gerrit.wikimedia.org/r/251552 (https://phabricator.wikimedia.org/T105410) (owner: 10Dzahn)
[22:43:15] <grrrit-wm>	 (03PS2) 10Dereckson: Set throttle exception for University of Haifa wiki event [mediawiki-config] - 10https://gerrit.wikimedia.org/r/252010 (https://phabricator.wikimedia.org/T118122) (owner: 10Luke081515)
[22:43:22] <grrrit-wm>	 (03CR) 10Dereckson: [C: 031] Set throttle exception for University of Haifa wiki event [mediawiki-config] - 10https://gerrit.wikimedia.org/r/252010 (https://phabricator.wikimedia.org/T118122) (owner: 10Luke081515)
[22:43:27] <wikibugs>	 6operations: install/setup/deploy neodymium as salt-master in eqiad - https://phabricator.wikimedia.org/T118210#1794735 (10RobH)
[22:43:28] <grrrit-wm>	 (03CR) 10jenkins-bot: [V: 04-1] Enable CirrusSearch writes to enwiki and dewiki as well [mediawiki-config] - 10https://gerrit.wikimedia.org/r/252121 (owner: 10EBernhardson)
[22:44:33] <grrrit-wm>	 (03Abandoned) 10Dzahn: analytics::mysql::meta, move standard/fw to role [puppet] - 10https://gerrit.wikimedia.org/r/250617 (owner: 10Dzahn)
[22:44:45] <grrrit-wm>	 (03PS2) 10EBernhardson: Enable CirrusSearch writes to enwiki and dewiki as well [mediawiki-config] - 10https://gerrit.wikimedia.org/r/252121 
[22:46:17] <icinga-wm_>	 PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 12 data above and 3 below the confidence bounds
[22:49:15] <ottomata>	 !log rebooting analytics1032: https://phabricator.wikimedia.org/T118175
[22:49:19] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[22:51:58] <icinga-wm_>	 PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 12 data above and 3 below the confidence bounds
[22:52:48] <icinga-wm_>	 PROBLEM - Host analytics1032 is DOWN: PING CRITICAL - Packet loss = 100%
[22:57:37] <icinga-wm_>	 PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 11 data above and 4 below the confidence bounds
[22:58:06] <icinga-wm_>	 RECOVERY - Host analytics1032 is UP: PING OK - Packet loss = 0%, RTA = 0.32 ms
[22:58:47] <icinga-wm_>	 RECOVERY - puppet last run on mw1172 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[23:01:37] <icinga-wm_>	 PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 11 data above and 3 below the confidence bounds
[23:05:26] <icinga-wm_>	 PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 10 data above and 4 below the confidence bounds
[23:06:16] <icinga-wm_>	 PROBLEM - configured eth on analytics1032 is CRITICAL: Connection refused by host
[23:06:26] <icinga-wm_>	 PROBLEM - SSH on analytics1032 is CRITICAL: Connection refused
[23:06:26] <icinga-wm_>	 PROBLEM - Check size of conntrack table on analytics1032 is CRITICAL: Connection refused by host
[23:06:27] <icinga-wm_>	 PROBLEM - Disk space on Hadoop worker on analytics1032 is CRITICAL: Connection refused by host
[23:06:37] <icinga-wm_>	 PROBLEM - salt-minion processes on analytics1032 is CRITICAL: Connection refused by host
[23:06:47] <icinga-wm_>	 PROBLEM - DPKG on analytics1032 is CRITICAL: Connection refused by host
[23:06:56] <icinga-wm_>	 PROBLEM - Disk space on analytics1032 is CRITICAL: Connection refused by host
[23:06:56] <icinga-wm_>	 PROBLEM - puppet last run on db2030 is CRITICAL: CRITICAL: puppet fail
[23:07:06] <icinga-wm_>	 PROBLEM - dhclient process on analytics1032 is CRITICAL: Connection refused by host
[23:08:07] <icinga-wm_>	 RECOVERY - configured eth on analytics1032 is OK: OK - interfaces up
[23:08:17] <icinga-wm_>	 RECOVERY - SSH on analytics1032 is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2.3 (protocol 2.0)
[23:08:17] <icinga-wm_>	 RECOVERY - Check size of conntrack table on analytics1032 is OK: OK: nf_conntrack is 0 % full
[23:08:17] <icinga-wm_>	 RECOVERY - Disk space on Hadoop worker on analytics1032 is OK: DISK OK
[23:08:36] <icinga-wm_>	 RECOVERY - salt-minion processes on analytics1032 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion
[23:08:37] <icinga-wm_>	 RECOVERY - DPKG on analytics1032 is OK: All packages OK
[23:08:47] <icinga-wm_>	 RECOVERY - Disk space on analytics1032 is OK: DISK OK
[23:08:56] <icinga-wm_>	 RECOVERY - dhclient process on analytics1032 is OK: PROCS OK: 0 processes with command name dhclient
[23:10:16] <icinga-wm_>	 RECOVERY - puppet last run on analytics1032 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[23:11:34] <grrrit-wm>	 (03CR) 10EBernhardson: [C: 032] Enable CirrusSearch writes to labsearch for all but enwiki and dewiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/252120 (owner: 10EBernhardson)
[23:12:02] <grrrit-wm>	 (03Merged) 10jenkins-bot: Enable CirrusSearch writes to labsearch for all but enwiki and dewiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/252120 (owner: 10EBernhardson)
[23:13:47] <logmsgbot>	 !log ebernhardson@tin Synchronized wmf-config/InitialiseSettings.php: Enable CirrusSearch writes to labsearch for all but enwiki and dewiki (duration: 00m 35s)
[23:13:51] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[23:25:56] <ebernhardson>	 AaronSchulz: hey i was looking at the job queue, and i'm not sure anything is moving from the delayed jobs into the main job queue. but i could be wrong
[23:26:11] <ebernhardson>	 AaronSchulz: on the 4 job runners i checked all have 'Caught signal (15)' as the last item in jobchron.log
[23:26:19] <ebernhardson>	 (15 == sigterm == kill)
[23:26:35] <ebernhardson>	 but they all looked to be running the service
[23:27:36] <ebernhardson>	 for an example, i've been watching cirrusSearchIncomingLinkCount on enwiki, which only increases. Also cirrusSearchElasticaWrite which is an uncommonly inserted job, it has been at the same # of delayed jobs for the last 30 minutes although the longest delay it will use is 15 minutes
[23:27:46] <ebernhardson>	 IncomingLinkCount typically uses <1 minute for delay
[23:29:50] <grrrit-wm>	 (03PS1) 10ArielGlenn: start work on cleaning up 'chunks' handling [dumps] (ariel) - 10https://gerrit.wikimedia.org/r/252131 
[23:29:52] <grrrit-wm>	 (03PS1) 10ArielGlenn: dumps: clean up construction of list of possible dump jobs for wiki [dumps] (ariel) - 10https://gerrit.wikimedia.org/r/252132 
[23:29:54] <grrrit-wm>	 (03PS1) 10ArielGlenn: dumps: clean up many comments of methods for dumps jobs [dumps] (ariel) - 10https://gerrit.wikimedia.org/r/252133 
[23:29:56] <grrrit-wm>	 (03PS1) 10ArielGlenn: dumps: clean up docstrings for recompress jobs [dumps] (ariel) - 10https://gerrit.wikimedia.org/r/252134 
[23:31:57] <AaronSchulz>	 ebernhardson: https://grafana-admin.wikimedia.org/dashboard/db/job-queue-rate shows undelaying of jobs (I just added that stat btw)
[23:33:51] <ebernhardson>	 AaronSchulz: so it is doing things, just not since i started watching :)
[23:34:58] <icinga-wm_>	 RECOVERY - puppet last run on db2030 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[23:37:04] <wikibugs>	 6operations: install/setup/deploy neodymium as salt-master in eqiad - https://phabricator.wikimedia.org/T118210#1794854 (10RobH) box has issue, it keeps booting into pxe and not off boot disk.  perhaps jessie installs to ssds while system tries to boot of sata?  (i'll further investigate later)
[23:46:26] <icinga-wm_>	 PROBLEM - Kafka Broker Replica Max Lag on kafka1020 is CRITICAL: CRITICAL: 50.00% of data above the critical threshold [5000000.0]
[23:50:07] <icinga-wm_>	 PROBLEM - Kafka Broker Replica Max Lag on kafka1020 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [5000000.0]
[23:53:33] <grrrit-wm>	 (03PS1) 10EBernhardson: Revert "Enable CirrusSearch writes to labsearch for all but enwiki and dewiki" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/252136 
[23:53:42] <grrrit-wm>	 (03CR) 10EBernhardson: [C: 032] Revert "Enable CirrusSearch writes to labsearch for all but enwiki and dewiki" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/252136 (owner: 10EBernhardson)
[23:54:03] <grrrit-wm>	 (03Merged) 10jenkins-bot: Revert "Enable CirrusSearch writes to labsearch for all but enwiki and dewiki" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/252136 (owner: 10EBernhardson)
[23:54:06] <wikibugs>	 6operations, 10OTRS: Upgrade OTRS to a more recent stable release - https://phabricator.wikimedia.org/T74109#1794873 (10Reedy) @akosiaris There's a report from @Rjd0060 that the test instance is slow. Is that likely just a result of the "hardware" it's on/resources assigned to it vs the actual production machine?
[23:55:03] <logmsgbot>	 !log ebernhardson@tin Synchronized wmf-config/InitialiseSettings.php: Disable cirrus writes to labsearch, the machine cant take the load and some jobs are timing out (duration: 00m 34s)
[23:55:07] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master