[00:58:36] PROBLEM - puppet last run on cp2002 is CRITICAL: CRITICAL: puppet fail [01:26:17] RECOVERY - puppet last run on cp2002 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [02:08:26] PROBLEM - puppet last run on lvs4004 is CRITICAL: CRITICAL: puppet fail [02:21:42] !log l10nupdate@tin Synchronized php-1.27.0-wmf.5/cache/l10n: l10nupdate for 1.27.0-wmf.5 (duration: 06m 49s) [02:21:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [02:36:08] RECOVERY - puppet last run on lvs4004 is OK: OK: Puppet is currently enabled, last run 40 seconds ago with 0 failures [05:18:37] (03PS3) 10MaxSem: Switch www.wikimedia.org to source control [puppet] - 10https://gerrit.wikimedia.org/r/249009 (https://phabricator.wikimedia.org/T115964) [05:21:38] PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 10 data above and 5 below the confidence bounds [05:31:06] PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 11 data above and 8 below the confidence bounds [05:42:56] slave lag? [05:42:57] Warning: The database has been locked for maintenance, so you will not be able to save your edits right now. You may wish to copy and paste your text into a text file and save it for later. [05:42:57] The administrator who locked it offered this explanation: The database has been automatically locked while the slave database servers catch up to the master. [05:43:20] its fine now [05:49:27] RECOVERY - HTTP error ratio anomaly detection on graphite1001 is OK: OK: No anomaly detected [06:30:37] PROBLEM - puppet last run on mw1008 is CRITICAL: CRITICAL: Puppet has 1 failures [06:30:40] anyone toy'd with prometheus.io [06:30:46] PROBLEM - puppet last run on db2055 is CRITICAL: CRITICAL: puppet fail [06:31:17] PROBLEM - puppet last run on mw2016 is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:26] PROBLEM - puppet last run on cp2002 is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:47] PROBLEM - puppet last run on cp3017 is CRITICAL: CRITICAL: Puppet has 1 failures [06:32:18] PROBLEM - puppet last run on mw2129 is CRITICAL: CRITICAL: Puppet has 1 failures [06:56:27] RECOVERY - puppet last run on mw1008 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:56:36] RECOVERY - puppet last run on db2055 is OK: OK: Puppet is currently enabled, last run 42 seconds ago with 0 failures [06:57:16] RECOVERY - puppet last run on mw2016 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:57:17] RECOVERY - puppet last run on cp2002 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:57:46] RECOVERY - puppet last run on cp3017 is OK: OK: Puppet is currently enabled, last run 56 seconds ago with 0 failures [06:58:08] RECOVERY - puppet last run on mw2129 is OK: OK: Puppet is currently enabled, last run 56 seconds ago with 0 failures [07:28:25] anybody knows what's going on with git deploy/git fat on tin? I try to deploy and it takes veyr long time and deployment does not work at the end [07:28:43] git-fat file is not resolved and I get 0/2 minions completed fetch too [07:40:37] PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 18 data above and 8 below the confidence bounds [07:51:38] RECOVERY - HTTP error ratio anomaly detection on graphite1001 is OK: OK: No anomaly detected [08:00:35] (03CR) 10Muehlenhoff: [C: 031] "Looks good to me. @JanZerebecki ; that's not a problem as longas the identifiers of the ferm rules (like librenms-http) are unique. In fac" [puppet] - 10https://gerrit.wikimedia.org/r/251550 (https://phabricator.wikimedia.org/T105410) (owner: 10Dzahn) [08:10:26] PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 11 data above and 2 below the confidence bounds [08:13:59] (03CR) 10Muehlenhoff: [C: 032 V: 032] Point git buildpackage upstream to HEAD of repo [debs/debdeploy] - 10https://gerrit.wikimedia.org/r/251506 (owner: 10Hashar) [08:15:56] PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 11 data above and 2 below the confidence bounds [08:36:42] 6operations, 6Release-Engineering-Team: deployment broken on wdqs1001 - https://phabricator.wikimedia.org/T118148#1792656 (10Smalyshev) 3NEW [08:38:17] RECOVERY - HTTP error ratio anomaly detection on graphite1001 is OK: OK: No anomaly detected [08:59:15] (03CR) 10TTO: "@Krenair: Please see comment on createTxtFileSymlinks" (033 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/157338 (https://phabricator.wikimedia.org/T17583) (owner: 10TTO) [08:59:22] (03PS9) 10TTO: Allow import from any Labs/Beta Cluster project to any other [mediawiki-config] - 10https://gerrit.wikimedia.org/r/157338 (https://phabricator.wikimedia.org/T17583) [08:59:54] ori: around? [09:10:14] !log freezing elasticsearch indices in eqiad (test) [09:10:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [09:18:45] !log restarting elastic on elastic1008.eqiad.wmnet (test) [09:18:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [09:19:17] !log err (1007 no 1008): restarting elastic on elastic1007.eqiad.wmnet (test) [09:19:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [09:29:40] !log swift codfw-prod: ms-be2016 / ms-be2018 / ms-be2020 weight 3000 [09:29:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [09:41:23] (03CR) 10Muehlenhoff: [C: 032 V: 032] .gitreview file [debs/debdeploy] - 10https://gerrit.wikimedia.org/r/251505 (owner: 10Hashar) [09:48:47] 6operations, 10Datasets-General-or-Unknown, 7HHVM, 5Patch-For-Review: Convert snapshot hosts to use HHVM and trusty - https://phabricator.wikimedia.org/T94277#1792716 (10ArielGlenn) [09:50:05] 6operations, 10Dumps-Generation: make dumps easy to rerun or clean up - https://phabricator.wikimedia.org/T110876#1792720 (10ArielGlenn) [09:55:05] 6operations, 10Dumps-Generation: make dumps easy to rerun or clean up - https://phabricator.wikimedia.org/T110876#1792726 (10ArielGlenn) That history rerun task needs to be updated and it doesn't block this work after all; we already rerun history jobs now automatically, as for example the previous month's dum... [09:55:32] 6operations, 10Dumps-Generation: make dumps easy to rerun or clean up - https://phabricator.wikimedia.org/T110876#1792727 (10ArielGlenn) 5Open>3Resolved Ah and with that commit this task is now complete. [09:57:34] 6operations: Puppet Compiler: Support wildcards, regexps, or 'all hosts' - https://phabricator.wikimedia.org/T114305#1792731 (10Joe) p:5Triage>3Low [09:57:50] 6operations, 10Beta-Cluster-Infrastructure, 7Blocked-on-RelEng, 7HHVM, 5Patch-For-Review: Convert work machines (tin, terbium) to Trusty and hhvm usage - https://phabricator.wikimedia.org/T87036#1792734 (10Joe) [09:57:52] 7Blocked-on-Operations, 6operations, 7HHVM, 5Patch-For-Review: Reimage mw1152 as a terbium replacement - https://phabricator.wikimedia.org/T116728#1792733 (10Joe) 5Open>3Resolved [10:02:02] <_joe_> !log repooled mw1061 [10:02:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [10:02:15] 6operations, 10ops-eqiad, 5Patch-For-Review: mw1061 has a faulty disk, filesystem is read-only - https://phabricator.wikimedia.org/T107849#1792750 (10Joe) 5Open>3Resolved [10:02:16] 6operations: Investigate idle/depooled eqiad appservers - https://phabricator.wikimedia.org/T116256#1792751 (10Joe) [10:04:13] (03PS5) 10Filippo Giunchedi: restbase: move to systemd unit file [puppet] - 10https://gerrit.wikimedia.org/r/244647 (https://phabricator.wikimedia.org/T103134) [10:06:31] (03PS1) 10ArielGlenn: dumps: more refactoring of classes [dumps] (ariel) - 10https://gerrit.wikimedia.org/r/251931 [10:06:32] (03PS1) 10ArielGlenn: dumps: split the huge jobs module into several manageable ones [dumps] (ariel) - 10https://gerrit.wikimedia.org/r/251932 [10:13:16] (03PS1) 10Filippo Giunchedi: base: further clarify service_unit ensure [puppet] - 10https://gerrit.wikimedia.org/r/251933 [10:14:11] (03CR) 10Filippo Giunchedi: restbase: move to systemd unit file (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/244647 (https://phabricator.wikimedia.org/T103134) (owner: 10Filippo Giunchedi) [10:15:55] (03PS4) 10ArielGlenn: dumps: move more classes into library, refactor link/feed/etc handling [dumps] (ariel) - 10https://gerrit.wikimedia.org/r/250921 [10:22:28] 6operations, 10hardware-requests: Allocate hardware for salt master in eqiad - https://phabricator.wikimedia.org/T115288#1792780 (10ArielGlenn) @mark? [10:24:31] (03CR) 10ArielGlenn: [C: 032 V: 032] dumps: move more classes into library, refactor link/feed/etc handling [dumps] (ariel) - 10https://gerrit.wikimedia.org/r/250921 (owner: 10ArielGlenn) [10:24:47] (03PS2) 10ArielGlenn: dumps: more refactoring of classes [dumps] (ariel) - 10https://gerrit.wikimedia.org/r/251931 [10:33:31] (03CR) 10Muehlenhoff: "The error in puppet compiler is caused by a merge conflict, since this patch depends on the now abandoned" [puppet] - 10https://gerrit.wikimedia.org/r/250078 (owner: 10Dzahn) [10:34:59] (03CR) 10ArielGlenn: [C: 032 V: 032] dumps: more refactoring of classes [dumps] (ariel) - 10https://gerrit.wikimedia.org/r/251931 (owner: 10ArielGlenn) [10:35:11] (03PS2) 10ArielGlenn: dumps: split the huge jobs module into several manageable ones [dumps] (ariel) - 10https://gerrit.wikimedia.org/r/251932 [10:36:01] mobrovac: heads up re: cassandra multiple instances in production, ping me when you are about [10:53:07] !log resuming writes to elasticsearch indices in eqiad (test) [10:53:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [10:54:29] (03PS1) 10Hashar: Jenkins: sync cli-shutdown.groovy from upstream [puppet] - 10https://gerrit.wikimedia.org/r/251935 (https://phabricator.wikimedia.org/T118064) [11:00:31] 6operations, 10Gitblit: Accessing raw link on git.wikimedia.org causes "Error Sorry, the repository mediawiki does not have a extensions branch!" - https://phabricator.wikimedia.org/T118156#1792867 (10saper) 3NEW [11:01:51] 6operations, 10Gitblit: Accessing raw link on git.wikimedia.org causes "Error Sorry, the repository mediawiki does not have a extensions branch!" - https://phabricator.wikimedia.org/T118156#1792880 (10saper) http://git.wikimedia.org/raw/mediawiki%2Fextensions%2FSemanticMediaWiki.git/master/COPYING has the sam... [11:09:20] !log Upgrading Jenkins from LTS 1.609.3 to LTS 1.625.1 https://phabricator.wikimedia.org/T118157 [11:09:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [11:14:08] (03PS1) 10Muehlenhoff: Enable ferm on kafka1014 [puppet] - 10https://gerrit.wikimedia.org/r/251936 [11:16:00] !log Jenkins back and apparently happy [11:16:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [11:17:30] (03PS2) 10Muehlenhoff: Enable ferm on kafka1014 [puppet] - 10https://gerrit.wikimedia.org/r/251936 [11:18:07] 6operations, 10Continuous-Integration-Infrastructure, 7Jenkins, 7WorkType-Maintenance: Please refresh Jenkins package on apt.wikimedia.org to 1.625.1 - https://phabricator.wikimedia.org/T118158#1792900 (10hashar) 3NEW [11:23:29] 6operations, 7Database: dbtree fails to render correctly on a new server (mw1152) both with zend php and hhvm - https://phabricator.wikimedia.org/T118159#1792947 (10Joe) 3NEW [11:23:48] <_joe_> jynus: I marked this task "Database" ^^ but feel free to strip the label [11:26:58] _joe_, that is an answer from memcached [11:28:08] oh, no sorry [11:28:13] I misread the line [11:28:23] it is mysql_fetch_array [11:29:40] <_joe_> yup [11:36:13] (03PS7) 10Muehlenhoff: Create a define to register extra LDAP schemas [puppet] - 10https://gerrit.wikimedia.org/r/251229 (https://phabricator.wikimedia.org/T101299) [11:37:07] (03CR) 10Muehlenhoff: [C: 032 V: 032] Create a define to register extra LDAP schemas [puppet] - 10https://gerrit.wikimedia.org/r/251229 (https://phabricator.wikimedia.org/T101299) (owner: 10Muehlenhoff) [11:39:22] 6operations, 10Deployment-Systems, 6Release-Engineering-Team: deployment broken on wdqs1001 - https://phabricator.wikimedia.org/T118148#1793022 (10hashar) [11:42:21] (03PS5) 10Muehlenhoff: openldap: Allow configurable ACLs [puppet] - 10https://gerrit.wikimedia.org/r/251272 (https://phabricator.wikimedia.org/T101299) [11:44:42] (03PS1) 10Jcrespo: Depooling again db1060 for more maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/251939 [11:44:58] (03PS2) 10Jcrespo: Depooling again db1060 for more maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/251939 [11:51:37] (03CR) 10Jcrespo: [C: 032] Depooling again db1060 for more maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/251939 (owner: 10Jcrespo) [11:51:59] (03Merged) 10jenkins-bot: Depooling again db1060 for more maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/251939 (owner: 10Jcrespo) [12:00:11] godog: pong [12:01:25] mobrovac: ack! looks like we're good to go? I'll prepare the code reviews [12:02:18] godog: let me check something first [12:02:36] mobrovac: yeah it'll take some time anyway [12:02:57] (03CR) 10Muehlenhoff: "The debdeploy/Hiera part of the change works fine, but puppet compiler also shows changes to the sshd config?" [puppet] - 10https://gerrit.wikimedia.org/r/250659 (owner: 10Muehlenhoff) [12:10:41] !log jynus@tin Synchronized wmf-config/db-eqiad.php: Depooling db1060 again (duration: 01m 13s) [12:10:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [12:14:41] 6operations, 10Wikibase-Quality, 10Wikidata, 7Database: Wikibase dispatchChanges.php runs slow, creates an absurd amount of database connections - https://phabricator.wikimedia.org/T118162#1793064 (10Joe) 3NEW [12:14:45] <_joe_> aude: ^^ [12:14:56] <_joe_> please add tags that I might have missed [12:15:15] 6operations, 10Gitblit: Accessing raw link on git.wikimedia.org causes "Error Sorry, the repository mediawiki does not have a extensions branch!" - https://phabricator.wikimedia.org/T118156#1793074 (10Aklapper) Same as T117459 ? Not sure why "operations" was added to this task? [12:23:03] (03PS1) 10Filippo Giunchedi: cassandra: additional instances for eqiad/codfw production [dns] - 10https://gerrit.wikimedia.org/r/251941 (https://phabricator.wikimedia.org/T95250) [12:23:14] (03PS2) 10Muehlenhoff: Reorg server groups for debdeploy [puppet] - 10https://gerrit.wikimedia.org/r/250932 [12:25:09] 6operations, 7Database: dbtree fails to render correctly on a new server (mw1152) both with zend php and hhvm - https://phabricator.wikimedia.org/T118159#1793083 (10Krenair) Did you fix this? It looks like it's working to me. [12:25:24] <_joe_> Krenair: yes we fixed it [12:25:36] <_joe_> but jaime did, and he's performing an interview [12:25:58] (03PS2) 10Filippo Giunchedi: cassandra: additional instances for eqiad/codfw production [dns] - 10https://gerrit.wikimedia.org/r/251941 (https://phabricator.wikimedia.org/T95250) [12:26:04] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] cassandra: additional instances for eqiad/codfw production [dns] - 10https://gerrit.wikimedia.org/r/251941 (https://phabricator.wikimedia.org/T95250) (owner: 10Filippo Giunchedi) [12:27:11] (03CR) 10Muehlenhoff: [C: 032 V: 032] Reorg server groups for debdeploy [puppet] - 10https://gerrit.wikimedia.org/r/250932 (owner: 10Muehlenhoff) [12:28:00] ok [12:28:41] godog: need to do a deploy before you decommission one server in codfw so that DTCS doesn't get reverted when restbase boots there [12:29:23] <_joe_> every time I read these things ^^, it's mildly terrifying, you know that? [12:30:06] i know _joe_ [12:32:40] godog: wait, you want to add 3x instances in both eqiad and codfw at the same time or am i reading https://gerrit.wikimedia.org/r/251941 wrong? [12:33:25] mobrovac: nah that's just dns [12:33:33] k [12:35:06] PROBLEM - puppet last run on elastic1014 is CRITICAL: CRITICAL: Puppet has 1 failures [12:39:00] (03PS2) 10Filippo Giunchedi: base: further clarify service_unit ensure [puppet] - 10https://gerrit.wikimedia.org/r/251933 [12:39:07] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] base: further clarify service_unit ensure [puppet] - 10https://gerrit.wikimedia.org/r/251933 (owner: 10Filippo Giunchedi) [12:39:52] 6operations, 10MediaWiki-extensions-WikibaseRepository, 10Wikidata, 7Database: Wikibase dispatchChanges.php runs slow, creates an absurd amount of database connections - https://phabricator.wikimedia.org/T118162#1793130 (10Addshore) [12:42:10] 6operations, 10MediaWiki-extensions-WikibaseRepository, 10Wikidata, 7Database: Wikibase dispatchChanges.php runs slow, creates an absurd amount of database connections - https://phabricator.wikimedia.org/T118162#1793142 (10Addshore) Related: - T111353 - T108944 - T48643 - T70381 - T70382 [12:48:30] 6operations, 10ops-esams: Power cr2-esams PEM 2/PEM 3 - https://phabricator.wikimedia.org/T118166#1793163 (10faidon) 3NEW a:3mark [12:52:40] (03PS2) 10Alexandros Kosiaris: Jenkins: sync cli-shutdown.groovy from upstream [puppet] - 10https://gerrit.wikimedia.org/r/251935 (https://phabricator.wikimedia.org/T118064) (owner: 10Hashar) [12:52:48] (03CR) 10Alexandros Kosiaris: [C: 032 V: 032] Jenkins: sync cli-shutdown.groovy from upstream [puppet] - 10https://gerrit.wikimedia.org/r/251935 (https://phabricator.wikimedia.org/T118064) (owner: 10Hashar) [12:53:21] akosiaris: thanks! will run puppet and get jenkins restarted :-} [12:53:56] hashar: yw [12:54:25] (03CR) 10Gilles: [C: 04-1] swift: monitor mediawiki originals upload rate (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/251526 (https://phabricator.wikimedia.org/T92322) (owner: 10Filippo Giunchedi) [12:55:47] (03CR) 10Alexandros Kosiaris: [C: 031] openldap: Allow configurable ACLs [puppet] - 10https://gerrit.wikimedia.org/r/251272 (https://phabricator.wikimedia.org/T101299) (owner: 10Muehlenhoff) [12:55:57] PROBLEM - puppet last run on mw1132 is CRITICAL: CRITICAL: Puppet has 1 failures [12:56:38] !log restarting Jenkins to refresh the cli-shutdown.groovy script -- https://gerrit.wikimedia.org/r/251935 (https://phabricator.wikimedia.org/T118064) [12:56:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [12:57:58] (03PS1) 10Muehlenhoff: More tweaks for server groups [puppet] - 10https://gerrit.wikimedia.org/r/251943 [12:58:13] (03PS2) 10Muehlenhoff: More tweaks for server groups [puppet] - 10https://gerrit.wikimedia.org/r/251943 [12:59:06] !log restbase start deploying ae2a44f [12:59:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [12:59:32] (03PS3) 10Alexandros Kosiaris: exim: Add and use $::other_site to provide LDAP fallback [puppet] - 10https://gerrit.wikimedia.org/r/249868 (https://phabricator.wikimedia.org/T82662) [12:59:34] (03PS2) 10Alexandros Kosiaris: exim: removal of non-DC aware ldap-mirror CNAME [puppet] - 10https://gerrit.wikimedia.org/r/250438 [12:59:55] (03CR) 10Muehlenhoff: [C: 032 V: 032] More tweaks for server groups [puppet] - 10https://gerrit.wikimedia.org/r/251943 (owner: 10Muehlenhoff) [13:01:27] 6operations, 10Beta-Cluster-Infrastructure: Can't apply ::role::logging::mediawiki on a trusty host - https://phabricator.wikimedia.org/T98627#1793203 (10hashar) 5Open>3Invalid a:3hashar deployment-fluorine has been rebuild as a Precise host. No point in keeping this task around, whenever one migrates it... [13:01:36] RECOVERY - puppet last run on elastic1014 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [13:06:38] (03PS1) 10Filippo Giunchedi: install_server: cassandra multi instance in eqiad/codfw [puppet] - 10https://gerrit.wikimedia.org/r/251944 [13:06:40] (03PS1) 10Filippo Giunchedi: cassandra: add restbase[12]00[12] to seeds [puppet] - 10https://gerrit.wikimedia.org/r/251945 [13:08:07] PROBLEM - Kafka Broker Replica Max Lag on kafka1013 is CRITICAL: CRITICAL: 83.33% of data above the critical threshold [5000000.0] [13:10:37] (03PS1) 10Muehlenhoff: Fix grain name [puppet] - 10https://gerrit.wikimedia.org/r/251946 [13:10:47] (03PS2) 10Filippo Giunchedi: install_server: cassandra multi instance in eqiad/codfw [puppet] - 10https://gerrit.wikimedia.org/r/251944 [13:10:56] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] install_server: cassandra multi instance in eqiad/codfw [puppet] - 10https://gerrit.wikimedia.org/r/251944 (owner: 10Filippo Giunchedi) [13:11:15] (03PS2) 10Filippo Giunchedi: cassandra: add restbase[12]00[12] to seeds [puppet] - 10https://gerrit.wikimedia.org/r/251945 [13:11:23] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] cassandra: add restbase[12]00[12] to seeds [puppet] - 10https://gerrit.wikimedia.org/r/251945 (owner: 10Filippo Giunchedi) [13:11:33] 6operations, 10Beta-Cluster-Infrastructure, 7Performance: Need a way to simulate replication lag to test replag issues - https://phabricator.wikimedia.org/T40945#1793233 (10hashar) 5Open>3stalled a:5Nikerabbit>3None [13:13:24] !log restbase finished deploying ae2a44f [13:13:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [13:13:47] godog: deploy done [13:14:40] (03PS2) 10Muehlenhoff: Fix grain name [puppet] - 10https://gerrit.wikimedia.org/r/251946 [13:14:57] (03CR) 10Muehlenhoff: [C: 032 V: 032] Fix grain name [puppet] - 10https://gerrit.wikimedia.org/r/251946 (owner: 10Muehlenhoff) [13:15:53] (03CR) 10ArielGlenn: [C: 032 V: 032] dumps: split the huge jobs module into several manageable ones [dumps] (ariel) - 10https://gerrit.wikimedia.org/r/251932 (owner: 10ArielGlenn) [13:16:26] mobrovac: ack! I'll decomission restbase2001 [13:16:48] !log nodetool decomission restbase2001 [13:16:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [13:21:17] RECOVERY - Kafka Broker Replica Max Lag on kafka1013 is OK: OK: Less than 1.00% above the threshold [1000000.0] [13:22:26] RECOVERY - puppet last run on mw1132 is OK: OK: Puppet is currently enabled, last run 51 seconds ago with 0 failures [13:23:49] mobrovac: it is decomissioning, I'll go for lunch meanwhile [13:25:16] moritzm: merging your change [13:27:55] thanks, sorry [13:28:17] (03CR) 10Faidon Liambotis: [C: 031] "LGTM for whenever you feel it's time to merge." [puppet] - 10https://gerrit.wikimedia.org/r/250438 (owner: 10Alexandros Kosiaris) [13:28:34] np [13:32:57] RECOVERY - Apache HTTP on mw1189 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.066 second response time [13:32:57] RECOVERY - HHVM rendering on mw1189 is OK: HTTP OK: HTTP/1.1 200 OK - 70590 bytes in 0.314 second response time [13:38:58] (03CR) 10BBlack: [C: 04-1] "Well IE8/XP wouldn't be the only case, there are those other minority clients that we're unlikely to ever see on misc, or would see extrem" [puppet] - 10https://gerrit.wikimedia.org/r/251704 (owner: 10BBlack) [13:39:34] (03Abandoned) 10BBlack: set cache_misc to "mid" ciphersuite [puppet] - 10https://gerrit.wikimedia.org/r/251704 (owner: 10BBlack) [13:39:57] PROBLEM - puppet last run on cp1067 is CRITICAL: CRITICAL: Puppet has 1 failures [13:41:21] 6operations, 5Continuous-Integration-Scaling, 7Nodepool: Backport python-shade from debian/testing to jessie-wikimedia - https://phabricator.wikimedia.org/T107267#1793341 (10hashar) a:3hashar [13:49:57] 6operations, 10hardware-requests: Detail codfw snapshot/dataset requirements - https://phabricator.wikimedia.org/T118173#1793344 (10RobH) 3NEW a:3RobH [13:51:55] 6operations, 10hardware-requests: Detail codfw snapshot/dataset requirements - https://phabricator.wikimedia.org/T118173#1793353 (10RobH) Below is a copy from my old email entry in the RT ticket: We need to replace the snapshot and dataset infrastructure from Tampa for CODFW. All the hardware was out of warr... [13:53:16] RECOVERY - puppet last run on hafnium is OK: OK: Puppet is currently enabled, last run 20 seconds ago with 0 failures [13:53:50] 6operations, 10hardware-requests: Detail codfw snapshot/dataset requirements - https://phabricator.wikimedia.org/T118173#1793354 (10RobH) a:5RobH>3ArielGlenn Assigning to Ariel for overall input and detailing of the requirements for a snapshot/dataset cluster host in codfw. Keep in mind we'll have to orde... [14:04:48] RECOVERY - puppet last run on cp1067 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [14:05:29] (03PS1) 10Faidon Liambotis: Kill CirrusSearch-slow-queries alert [puppet] - 10https://gerrit.wikimedia.org/r/251948 (https://phabricator.wikimedia.org/T84163) [14:05:50] (03CR) 10Faidon Liambotis: [C: 032] Kill CirrusSearch-slow-queries alert [puppet] - 10https://gerrit.wikimedia.org/r/251948 (https://phabricator.wikimedia.org/T84163) (owner: 10Faidon Liambotis) [14:06:43] (03PS6) 10Muehlenhoff: openldap: Allow configurable ACLs [puppet] - 10https://gerrit.wikimedia.org/r/251272 (https://phabricator.wikimedia.org/T101299) [14:09:17] (03PS1) 10Muehlenhoff: Assign salt grains for cp* hosts [puppet] - 10https://gerrit.wikimedia.org/r/251949 [14:10:17] (03PS1) 10Giuseppe Lavagetto: dbtree: move to its own directory [puppet] - 10https://gerrit.wikimedia.org/r/251950 [14:11:20] (03PS2) 10Giuseppe Lavagetto: dbtree: move to its own directory [puppet] - 10https://gerrit.wikimedia.org/r/251950 [14:14:02] (03CR) 10Giuseppe Lavagetto: [C: 032] dbtree: move to its own directory [puppet] - 10https://gerrit.wikimedia.org/r/251950 (owner: 10Giuseppe Lavagetto) [14:15:02] (03PS1) 10Muehlenhoff: zookeeper: Don't expose the JMX port in ferm rules [puppet] - 10https://gerrit.wikimedia.org/r/251951 [14:15:22] (03CR) 10Ottomata: [C: 031] Enable ferm on kafka1014 [puppet] - 10https://gerrit.wikimedia.org/r/251936 (owner: 10Muehlenhoff) [14:15:29] moritzm: let's do it! [14:17:24] ottomata: ok, I'll rebase, merge and puppet-run it [14:17:52] 6operations, 6Phabricator: migrate RT main-announce into phabricator - https://phabricator.wikimedia.org/T118176#1793387 (10RobH) 3NEW [14:17:58] (03PS3) 10Muehlenhoff: Enable ferm on kafka1014 [puppet] - 10https://gerrit.wikimedia.org/r/251936 [14:18:12] 6operations, 6Phabricator: migrate RT main-announce into phabricator - https://phabricator.wikimedia.org/T118176#1793394 (10RobH) a:3RobH I'll detail out how mail is routed and how we triage the requests shortly. [14:18:49] (03PS2) 10Andrew Bogott: Assign IPs for public labtest hosts. [dns] - 10https://gerrit.wikimedia.org/r/251655 [14:19:31] (03CR) 10Muehlenhoff: [C: 032 V: 032] Enable ferm on kafka1014 [puppet] - 10https://gerrit.wikimedia.org/r/251936 (owner: 10Muehlenhoff) [14:19:33] (03CR) 10Andrew Bogott: [C: 032] Assign IPs for public labtest hosts. [dns] - 10https://gerrit.wikimedia.org/r/251655 (owner: 10Andrew Bogott) [14:20:01] moritzm: i'm watching kafka logs on a couple of brokers [14:20:05] dooo it [14:20:06] :) [14:20:22] puppet run is ongoing, will add some logging rules once done [14:20:29] 6operations, 10hardware-requests: Allocate hardware for salt master in eqiad - https://phabricator.wikimedia.org/T115288#1793396 (10mark) >>! In T115288#1754857, @RobH wrote: > Chatted with Ariel in IRC. > > Going to go with one of the: > > Dell PowerEdge R420, Dual Intel Xeon E5-2440, 32GB Memory, Dual 300G... [14:20:30] (03CR) 10Ottomata: [C: 031] zookeeper: Don't expose the JMX port in ferm rules [puppet] - 10https://gerrit.wikimedia.org/r/251951 (owner: 10Muehlenhoff) [14:21:28] RECOVERY - puppet last run on terbium is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [14:22:36] ottomata: up and enabled, nothing in the logs so far [14:22:57] (03PS1) 10Giuseppe Lavagetto: terbium: do not include noc [puppet] - 10https://gerrit.wikimedia.org/r/251952 [14:23:18] 6operations, 10ops-eqiad: remove patch cable for cr1-eqiad:xe-4/2/1 ID 3482, circuit ID ETYX/084858//ZYO - https://phabricator.wikimedia.org/T118177#1793400 (10RobH) 3NEW a:3Cmjohnson [14:23:37] PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 10 data above and 2 below the confidence bounds [14:24:25] (03PS1) 10Andrew Bogott: Nova scheduler changes: [puppet] - 10https://gerrit.wikimedia.org/r/251954 [14:25:07] (03PS1) 10Faidon Liambotis: Remove subnet for ulsfo-eqiad Giglinx link [dns] - 10https://gerrit.wikimedia.org/r/251955 (https://phabricator.wikimedia.org/T118170) [14:25:21] (03CR) 10Giuseppe Lavagetto: [C: 032] terbium: do not include noc [puppet] - 10https://gerrit.wikimedia.org/r/251952 (owner: 10Giuseppe Lavagetto) [14:27:06] hmm, i see more broken pipe errors on ka14 than i would expect, i thikn... [14:27:18] yeah not looking good moritzm [14:27:23] hold [14:28:19] ottomata: I suppose that's some fallout of the rules kicking into live operation, there's no dropped traffic on 1014 per se [14:28:45] no? [14:29:04] trying to see what's going on though, replica ISRs have shrunk [14:29:15] 6operations, 5Continuous-Integration-Scaling, 7Nodepool: Backport python-shade from debian/testing to jessie-wikimedia - https://phabricator.wikimedia.org/T107267#1793414 (10hashar) I tried but I eventually give up. The toolchain is just too complicated for me to figure out. So at first the /debian/ source... [14:32:39] moritzm: this enabled base ferm on kafka1014, right? [14:33:05] moritzm: i think we should roll back [14:33:15] ottomata: ok, we can do that [14:33:16] (03PS2) 10Andrew Bogott: Nova scheduler changes: [puppet] - 10https://gerrit.wikimedia.org/r/251954 [14:34:20] yeah, and consumers seem to have stopped working too from ka14 [14:34:31] Could not receive response to request ... Kafka @ kafka1014.eqiad.wmnet:9092 went away [14:34:42] data = self._sock.recv(min(bytes_left, 4096)) [14:34:42] timeout: timed out [14:35:24] its strange though [14:35:44] kafka1012 seems not able to replicate from kafka1014 [14:35:48] but other brokers seem to be doing so fine. [14:36:10] <_joe_> !log reducing /tmp size on copper by shrinking the logical volume [14:36:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:36:23] (03PS1) 10Muehlenhoff: Revert "Enable ferm on kafka1014" [puppet] - 10https://gerrit.wikimedia.org/r/251962 [14:36:41] but, the consumers on eventlog1001 can't consume from kakka1014? [14:36:41] hmmMmm [14:36:56] really not sure what is going on here, am a little worried, there's some weird stuff going on with offset requests being out of range [14:37:17] (03CR) 10Andrew Bogott: [C: 032] Nova scheduler changes: [puppet] - 10https://gerrit.wikimedia.org/r/251954 (owner: 10Andrew Bogott) [14:37:41] 1012 is 10.64.5.12 and that is allowed in the rules for 9092 [14:38:26] moritzm: ja, but it has dropped out of in sync replica list for partitions where 1014 is the leader [14:38:32] my guess is that during the rules setup is 1014 was briefly unavailable and the others are acting on that (but we can also revert, see 251962 ) [14:39:00] if that was the case it would recover quickly, no? [14:39:08] also, eventlog1001 had a hiccup at the very least [14:39:23] ok moritzm, hold for one sec before merging, going to restart eventlogging and see what happens in logs there, maybe it can consume... [14:40:04] !log restarting eventlogging to see if it is ok after enabling firewall rules on kafka1014 [14:40:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:40:14] ok [14:43:12] ok moritzm, lets revert [14:43:19] i dunno what is happening, but it doesn't look normal [14:43:54] PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 11 data above and 7 below the confidence bounds [14:44:44] ottomata: ok, going ahead [14:45:11] (03PS2) 10Muehlenhoff: Revert "Enable ferm on kafka1014" [puppet] - 10https://gerrit.wikimedia.org/r/251962 [14:45:35] ottomata: I stopped ferm manually and will merge next [14:45:50] ok [14:45:52] (03CR) 10Muehlenhoff: [C: 032 V: 032] Revert "Enable ferm on kafka1014" [puppet] - 10https://gerrit.wikimedia.org/r/251962 (owner: 10Muehlenhoff) [14:46:47] moritzm: why would there be a hiccup at all? does ferm enable the full firewall before allowing connections to the configured ports? [14:46:48] 7Puppet, 5Patch-For-Review, 7Ruby: Fix easy problems reported by RuboCop in operations/puppet - https://phabricator.wikimedia.org/T112651#1793448 (10zeljkofilipin) [14:47:09] 7Puppet, 5Patch-For-Review, 7Ruby: Fix easy problems reported by RuboCop in operations/puppet - https://phabricator.wikimedia.org/T112651#1793450 (10zeljkofilipin) 5Open>3stalled [14:48:20] !log swift codfw-prod: ms-be2017 / ms-be2019 / ms-be2021 weight 1000 [14:48:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:48:48] moritzm: i wonder if we should try stopping kafka on kafka1014 [14:49:01] to allow consumers and producers and replicas to just CHILL [14:49:06] then enable ferm. [14:49:09] then start kafka back up. [14:49:10] 6operations, 6Phabricator: migrate RT maint-announce into phabricator - https://phabricator.wikimedia.org/T118176#1793452 (10scfc) [14:49:51] ottomata: that's caused by an interaction between ferm and puppet, since the rules which allow the granted traffic are added in steps, once the rules files in /etc/ferm/conf.d are fully generated, the rules are put into effect immediately [14:50:30] ottomata: yeah, stopping kafka before making the flip would indeed probably make sense here [14:52:22] ok, moritzm, the more I look, the more I think that doing that will be ok... I do see some really strange offset requests, but afaict everything was working fine. I think the replicas I saw out of sync were not really (https://issues.apache.org/jira/browse/KAFKA-1367) (I forgot about this confusing issue). [14:52:24] PROBLEM - puppet last run on mw2030 is CRITICAL: CRITICAL: Puppet has 1 failures [14:54:02] seems so (after all there were no failed connection attempts to 1014 at all), shall we make another try with a stopped kafka broker before enabling it? [14:55:23] ja think so, i'm watching offsets for an eventlogging consumer now too, which will help me feel a little better when i see weird offset requests. [14:55:31] moritzm: i'll take broker on 1014 down now... [14:56:34] !log stopping kafka broker on kafka1014 [14:56:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:56:52] hmm, moritzm, puppet will restart kafka [14:56:58] can you apply the ferm changes manually? [14:57:11] 6operations, 10ops-eqiad, 10netops: test new sfp-t - https://phabricator.wikimedia.org/T118178#1793456 (10RobH) 3NEW a:3Cmjohnson [14:58:46] ottomata: for 1014 I can simply restart it (since puppet created all the rules already) [14:59:54] ok moritzm, kafka is stopped on 1014 [14:59:56] go ahead [15:00:17] it's re-enabled [15:00:44] PROBLEM - puppet last run on mw1060 is CRITICAL: CRITICAL: Puppet has 1 failures [15:00:44] PROBLEM - Kafka Broker Under Replicated Partitions on kafka1018 is CRITICAL: CRITICAL: 14.29% of data above the critical threshold [10.0] [15:02:09] ok [15:02:12] PROBLEM - Kafka Broker Under Replicated Partitions on kafka1020 is CRITICAL: CRITICAL: 55.56% of data above the critical threshold [10.0] [15:02:18] !starting kafka broker on kafka1014 [15:02:28] !log starting kafka broker on kafka1014 [15:02:31] PROBLEM - Kafka Broker Under Replicated Partitions on kafka1012 is CRITICAL: CRITICAL: 37.50% of data above the critical threshold [10.0] [15:02:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:02:36] I cracked the code for recentchanges efficienty, but I need someone to dump my mind and reach to an implementation [15:03:05] maybe I should way for someone at perf [15:03:08] *Wait [15:05:56] 6operations, 10Traffic: Planning for phasing out non-Forward-Secret TLS ciphers - https://phabricator.wikimedia.org/T118181#1793501 (10BBlack) 3NEW [15:06:52] !log running kafka preferred-replica-election [15:06:52] PROBLEM - Kafka Broker Under Replicated Partitions on kafka1013 is CRITICAL: CRITICAL: 71.43% of data above the critical threshold [10.0] [15:06:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:08:22] PROBLEM - Kafka Broker Under Replicated Partitions on kafka1018 is CRITICAL: CRITICAL: 66.67% of data above the critical threshold [10.0] [15:09:04] 6operations, 6Discovery: Fix CirrusSearch monitoring - https://phabricator.wikimedia.org/T84163#1793514 (10chasemp) [15:09:32] PROBLEM - Kafka Broker Under Replicated Partitions on kafka1022 is CRITICAL: CRITICAL: 50.00% of data above the critical threshold [10.0] [15:09:34] (03PS1) 10Muehlenhoff: Enable ferm on kafka1014 again [puppet] - 10https://gerrit.wikimedia.org/r/251964 [15:09:35] ja, k moritzm, things are looking ok [15:09:42] RECOVERY - Router interfaces on mr1-codfw is OK: OK: host 208.80.153.196, interfaces up: 35, down: 0, dormant: 0, excluded: 0, unused: 0 [15:09:47] (ignore those underreplicate partition alerts, they will resolve shortly.) [15:11:24] ottomata: ok, merging so that the status in puppet and on the system is in sync again [15:11:33] PROBLEM - Kafka Broker Under Replicated Partitions on kafka1012 is CRITICAL: CRITICAL: 16.67% of data above the critical threshold [10.0] [15:11:51] (03CR) 10Muehlenhoff: [C: 032 V: 032] Enable ferm on kafka1014 again [puppet] - 10https://gerrit.wikimedia.org/r/251964 (owner: 10Muehlenhoff) [15:11:57] 6operations, 10ops-eqiad, 7Swift: [determine] rack ms-be1019-1021 - https://phabricator.wikimedia.org/T114711#1793526 (10fgiunchedi) I believe we went with 1G everywhere @cmjohnson? in any case looks like this is complete [15:12:16] 6operations, 6Phabricator: migrate RT maint-announce into phabricator - https://phabricator.wikimedia.org/T118176#1793527 (10RobH) We'll need to modify our workflow. Right now in RT, maint-annoucements come in multiple times for a single event. We'll typically get an initial notification of maintenance, then... [15:12:42] RECOVERY - Kafka Broker Under Replicated Partitions on kafka1013 is OK: OK: Less than 1.00% above the threshold [1.0] [15:13:08] ottomata: it's merged. ok to re-enable the puppet agent? [15:13:29] 6operations, 7Swift: add ms-be1019 / 1020 / 1021 to swift - https://phabricator.wikimedia.org/T118183#1793530 (10fgiunchedi) 3NEW a:3fgiunchedi [15:13:31] RECOVERY - Kafka Broker Under Replicated Partitions on kafka1022 is OK: OK: Less than 1.00% above the threshold [1.0] [15:13:47] 6operations, 6Phabricator: migrate RT maint-announce into phabricator - https://phabricator.wikimedia.org/T118176#1793539 (10RobH) We'll need to modify our workflow. Right now in RT, maint-annoucements come in multiple times for a single event. We'll typically get an initial notification of maintenance, then... [15:13:52] PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 10 data above and 6 below the confidence bounds [15:14:11] RECOVERY - Kafka Broker Under Replicated Partitions on kafka1018 is OK: OK: Less than 1.00% above the threshold [1.0] [15:15:01] RECOVERY - Kafka Broker Under Replicated Partitions on kafka1020 is OK: OK: Less than 1.00% above the threshold [1.0] [15:15:04] moritzm: yes [15:15:07] do please [15:15:32] RECOVERY - Kafka Broker Under Replicated Partitions on kafka1012 is OK: OK: Less than 1.00% above the threshold [1.0] [15:15:54] 6operations, 6Project-Creators: create #ops-eqdfw & #ops-eqord projects - https://phabricator.wikimedia.org/T117585#1793549 (10RobH) I was simply being paranoid. I'll go ahead and create these later today and will update that page, thanks for linking it @krenair. [15:17:52] RECOVERY - puppet last run on mw2030 is OK: OK: Puppet is currently enabled, last run 1 second ago with 0 failures [15:18:04] 7Puppet, 10Continuous-Integration-Config, 7Ruby: Move RuboCop job from experimental pipeline to the usual pipelines for operations/puppet - https://phabricator.wikimedia.org/T110019#1793567 (10zeljkofilipin) [15:19:48] ottomata: darian was asking the other day too, I'm not sure what happened with it -- is there a replacement for http://stats.wikimedia.org/wikimedia/squids/SquidReportClients.htm ? [15:21:18] paravoid: not that I know of, but you could get it out of https://wikitech.wikimedia.org/wiki/Analytics/Data/Pageview_hourly using user_agent_map [15:21:29] and/or the new pageview APi in AQS (which will be announced soon? ? maybe?) [15:21:50] paravoid: there is a lot of discussion in analytics goals now about replacing many parts of stats.wm.o, but i don't know details [15:22:02] PROBLEM - HTTPS-toolserver on www.toolserver.org is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:IO::Socket::SSL: connect: Connection timed out [15:23:01] moritzm: all looks fine [15:23:06] shall we proceed with other brokers? [15:23:19] oh, sorry, just saw your other message [15:23:31] Coren/YuviPanda/andrewbogott/chasemp: ^ toolserver.org alert [15:24:51] seems really down too, no clue where this lives but I'll get into it [15:24:56] (03PS2) 10Muehlenhoff: zookeeper: Don't expose the JMX port in ferm rules [puppet] - 10https://gerrit.wikimedia.org/r/251951 [15:25:30] (03CR) 10Muehlenhoff: [C: 032 V: 032] zookeeper: Don't expose the JMX port in ferm rules [puppet] - 10https://gerrit.wikimedia.org/r/251951 (owner: 10Muehlenhoff) [15:26:11] RECOVERY - HTTPS-toolserver on www.toolserver.org is OK: SSL OK - Certificate toolserver.org valid until 2016-06-30 17:56:02 +0000 (expires in 234 days) [15:27:20] <_joe_> chasemp: that would be on the tools-proxies first of all [15:27:25] <_joe_> as this was an ssl failure [15:27:41] RECOVERY - puppet last run on mw1060 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [15:34:07] (03PS1) 10Filippo Giunchedi: mediawiki: rename jobqueue.job-pop graphite alarm [puppet] - 10https://gerrit.wikimedia.org/r/251970 [15:35:01] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] mediawiki: rename jobqueue.job-pop graphite alarm [puppet] - 10https://gerrit.wikimedia.org/r/251970 (owner: 10Filippo Giunchedi) [15:42:47] 6operations, 6Labs, 10Labs-Infrastructure, 10hardware-requests, and 2 others: Labs test cluster in codfw - https://phabricator.wikimedia.org/T114435#1793660 (10chasemp) [15:42:48] 6operations, 6Labs, 10Labs-Infrastructure, 10netops, and 3 others: Allocate subnet for labs test cluster instances - https://phabricator.wikimedia.org/T115492#1793659 (10chasemp) 5Open>3Resolved [15:45:52] !log reimage graphite1002 [15:45:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:48:31] 6operations, 6Phabricator: migrate RT maint-announce into phabricator - https://phabricator.wikimedia.org/T118176#1793700 (10RobH) I've CC'd in @Dzahn. Daniel regularly patrols maint-announce and updates the tracking calendar, so I want to ensure we have him review our potential plan(s). [15:48:54] PROBLEM - Host graphite1002 is DOWN: PING CRITICAL - Packet loss = 100% [15:52:15] RECOVERY - Host graphite1002 is UP: PING OK - Packet loss = 0%, RTA = 1.08 ms [15:53:05] graphite is in mirror mode, right? [15:53:38] jynus: mirror to codfw, yes [15:54:13] (03CR) 10Subramanya Sastry: "This has been deployed already right?" [puppet] - 10https://gerrit.wikimedia.org/r/249399 (owner: 10Subramanya Sastry) [15:54:14] so graphite1001 and graphite1002 are sharded? [15:55:07] no, graphite1002 is being used for tests ATM, hence the light-hearthed reimage [15:55:15] ok [15:55:35] jynus: actually the machine I've used to test linux hybrid ssd/disk caching [15:56:24] PROBLEM - configured eth on graphite1002 is CRITICAL: Connection refused by host [15:56:44] PROBLEM - dhclient process on graphite1002 is CRITICAL: Connection refused by host [15:56:48] half-hearted good morning [15:56:55] PROBLEM - puppet last run on graphite1002 is CRITICAL: Connection refused by host [15:57:14] PROBLEM - salt-minion processes on graphite1002 is CRITICAL: Connection refused by host [15:57:14] PROBLEM - Disk space on graphite1002 is CRITICAL: Connection refused by host [15:57:28] PROBLEM - RAID on graphite1002 is CRITICAL: Connection refused by host [15:57:29] (03CR) 10Subramanya Sastry: "I ask because of https://www.mediawiki.org/wiki/Parsoid/Deployments#Monday.2C_Nov_9.2C_2015_around_1:15_pm_PT:_b869b084_to_be_deployed" [puppet] - 10https://gerrit.wikimedia.org/r/249399 (owner: 10Subramanya Sastry) [15:57:45] PROBLEM - DPKG on graphite1002 is CRITICAL: Connection refused by host [15:58:56] 10Ops-Access-Requests, 6operations: Requesting access to add perf-roots group to graphite role - https://phabricator.wikimedia.org/T117256#1793735 (10RobH) +1 to @chasemp's proposed rename of the group (no objection to request.) [16:00:04] anomie ostriches thcipriani marktraceur Krenair: Dear anthropoid, the time has come. Please deploy Morning SWAT(Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20151109T1600). [16:01:42] Here for SWAT [16:02:21] RECOVERY - HTTP error ratio anomaly detection on graphite1001 is OK: OK: No anomaly detected [16:02:27] hi [16:03:59] Krinkle, James_F: ping? [16:04:05] Is bgerstle not here? [16:04:10] bgerstile* [16:04:32] 6operations, 10hardware-requests: Detail codfw snapshot/dataset requirements - https://phabricator.wikimedia.org/T118173#1793760 (10ArielGlenn) It's not going to be straight up duplication. There are two things at play here: 1) I want to get rid of nfs when we deploy in codfw. If this seems like it's too h... [16:04:47] apparently I got it right the first time, it's spelt wrong on the calendar [16:05:23] pinged them in -mobile [16:05:48] matt_flaschen, shall we do yours first then? [16:05:49] James_F is in a meeting, e here soon [16:05:56] 6operations, 10Dumps-Generation, 10hardware-requests: Detail codfw snapshot/dataset requirements - https://phabricator.wikimedia.org/T118173#1793765 (10ArielGlenn) [16:06:18] Krenair, sure. [16:06:39] Krenair: Pong. [16:08:46] 6operations, 10Analytics, 6Analytics-Kanban, 6Discovery, and 8 others: EventBus MVP - https://phabricator.wikimedia.org/T114443#1793775 (10Ottomata) Hi all, I talked to @gwicke a little bit more about this last Thursday. He impressed upon me a couple of good points I hadn't fully taken in before, and I wa... [16:10:06] Just when I thought we had enough '*erbium's, apparently there's a #Mobile-App-Android-Sprint-70-Ytterbium project in phabricator [16:10:36] Krenair, I love their sprint names. [16:10:41] Mobile in general. [16:10:44] 6operations, 10Dumps-Generation, 10hardware-requests: Detail codfw snapshot/dataset requirements - https://phabricator.wikimedia.org/T118173#1793778 (10ArielGlenn) So @robh can you have a look at the eqiad hw ticket and let's hash that out first? Then we can use that as the basis for hw in codfw with whatev... [16:10:56] <_joe_> !log removing old builds from copper [16:11:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:11:35] Oh and a #Mobile-App-Android-Sprint-68-Erbium, I must've missed that one [16:11:44] and #Mobile-App-Sprint-65-Android-Terbium [16:11:55] Krenair: They're here to get you. [16:11:58] :) [16:12:33] hm, at some point the 'Sprint $x' and 'Android' got swapped [16:14:47] !log krenair@tin Synchronized php-1.27.0-wmf.5/extensions/Flow/modules/mw.flow.Initializer.js: https://gerrit.wikimedia.org/r/#/c/251560/ (duration: 00m 44s) [16:14:49] matt_flaschen, please test ^ [16:14:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:16:30] 10Ops-Access-Requests, 6operations: Requesting access to add perf-roots group to graphite role - https://phabricator.wikimedia.org/T117256#1793801 (10chasemp) >>! In T117256#1793735, @RobH wrote: > +1 to @chasemp's proposed rename of the group (no objection to request.) I did rename this already fyi [16:17:45] (03CR) 10Filippo Giunchedi: swift: monitor mediawiki originals upload rate (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/251526 (https://phabricator.wikimedia.org/T92322) (owner: 10Filippo Giunchedi) [16:17:57] (03PS2) 10Filippo Giunchedi: swift: monitor mediawiki originals upload rate [puppet] - 10https://gerrit.wikimedia.org/r/251526 (https://phabricator.wikimedia.org/T92322) [16:18:24] matt_flaschen, ? [16:18:33] 6operations, 10Dumps-Generation, 10hardware-requests: determine hardware needs for dumps in eqiad (boxes out of warranty, capacity planning) - https://phabricator.wikimedia.org/T118154#1793805 (10RobH) [16:18:57] Krenair, I tested, it works but there is a problem. [16:19:08] 6operations, 10Dumps-Generation, 10hardware-requests: determine hardware needs for dumps in eqiad (boxes out of warranty, capacity planning) - https://phabricator.wikimedia.org/T118154#1793806 (10RobH) I've claimed and added this to #hardware-requests. With the details provided by @ArielGlenn, I'll request... [16:19:33] It makes the edit successfully but doesn't render it, and there is a JS error. No need to revert, I don't think. I will fix the remaining issue now. [16:19:37] okay [16:19:55] [config+script] 251677 Enable Flow user opt-in Beta Feature on Wikidata task T116611 [16:20:31] matt_flaschen, James_F: Shall we proceed with this? What needs to be run exactly? [16:21:09] FlowUpdateBetaFeaturePreference.php needs to be run after. I can do that. [16:21:28] Just "mwscript extensions/Flow/maintenance/FlowUpdateBetaFeaturePreference.php wikidatawiki"? [16:21:32] ok, I'll leave that to you [16:21:55] (03CR) 10Alex Monk: [C: 032] Enable Flow user opt-in Beta Feature on Wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/251677 (https://phabricator.wikimedia.org/T116611) (owner: 10Jforrester) [16:22:55] (03Merged) 10jenkins-bot: Enable Flow user opt-in Beta Feature on Wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/251677 (https://phabricator.wikimedia.org/T116611) (owner: 10Jforrester) [16:24:02] !log krenair@tin Synchronized wmf-config/InitialiseSettings.php: https://gerrit.wikimedia.org/r/#/c/251677/ (duration: 00m 35s) [16:24:04] matt_flaschen, ^ please test [16:24:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:25:47] Krenair, works on my page (https://www.wikidata.org/wiki/User_talk:Mattflaschen-WMF). I'll run the script now. [16:26:05] 10Ops-Access-Requests, 6operations: Requesting access to add perf-roots group to graphite role - https://phabricator.wikimedia.org/T117256#1793813 (10chasemp) [16:26:57] (03CR) 10Alex Monk: [C: 032] Add an apple-app-site-association file used to support iOS deep-linking [mediawiki-config] - 10https://gerrit.wikimedia.org/r/250897 (https://phabricator.wikimedia.org/T111829) (owner: 10Bgerstle) [16:27:02] bgerstle, bd808: hi [16:27:10] Krenair: o/ [16:27:29] this should work for all wikipedia subdomains, right [16:27:30] ? [16:27:37] Krenair: yes [16:27:38] Krenair: correct [16:27:39] (03Merged) 10jenkins-bot: Add an apple-app-site-association file used to support iOS deep-linking [mediawiki-config] - 10https://gerrit.wikimedia.org/r/250897 (https://phabricator.wikimedia.org/T111829) (owner: 10Bgerstle) [16:27:44] Even the non-language-based ones? [16:27:46] including "m.$lang.wikipedia..." [16:27:52] Krenair: Flow/Wikidata LGTM. [16:28:06] Krenair: not concerned w/ meta/office if that's what you mean [16:28:15] those aren't on wikipedia.org [16:28:21] yeah sorry, [16:28:23] Krenair: the magic of when it gets used will be embedded in the iOS app itself [16:28:42] that file in particular, yes [16:28:58] Can the app handle nostalgia, for example? [16:29:00] apple will download it to the device when the app is installed, if the app has certain "entitlements" in its package [16:29:14] Krenair: oh right. or simple, i guess? [16:29:17] the arbcom-* subdomains? [16:29:24] yes, simple? [16:29:46] we're focusing on lang-specific ones, i guess. not much thought has been given to handle those [16:29:49] wg-en? test subdomains? [16:29:58] ten? [16:30:02] !log Ran mwscript extensions/Flow/maintenance/FlowUpdateBetaFeaturePreference.php --wiki=wikidatawiki [16:30:03] it should def work on test & betalabs. that's the key so we can start prototyping stuff [16:30:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:30:25] Can it work on those? If not, does it detect this and send the user to their browser? [16:31:26] bgerstle: the app will have to register domains to intercept, correct? I think Krenair is mostly worried that things will happen greedily and break things that work today. [16:31:29] Krenair: it will always open in safari initially [16:31:38] _if_ there's a site assoc. file for that domain [16:31:41] _and_ there's markup on the page [16:31:57] bd808, will it register subdomains or *.wikipedia.org? [16:32:00] _then_ the OS will prompt the user [16:32:15] Krenair: bd808 only the specific domains we register in the app [16:32:37] i was planning on writing a script which generates the entitlements (file where this is declared) based on Special:SiteMatrix [16:32:54] only the wikipedia column, for now at least [16:32:59] I was about to say, if you're registering against subdomains you really need to have a section on the new wiki creation process [16:33:05] But if it's sitematrix based that should be okay [16:33:39] Krenair: bd808 the main goal for now is to get the file up on beta labs so we can start working on the next steps, i.e. web markup and user flows [16:34:01] so it's not a blocker if this doesn't work on prod domains (as it shouldn't be used there anyway) [16:34:02] 6operations, 10OTRS: Upgrade OTRS to a more recent stable release - https://phabricator.wikimedia.org/T74109#1793847 (10Steinsplitter) >>! In T74109#1757895, @akosiaris wrote: > I 've upgraded the test installation today to OTRS version 5.0.1. There is one thing that has not been upgraded to version 5 and that... [16:34:08] ... You're aware that this is putting it in production, right? [16:34:21] it shouldn't have an effect [16:34:29] ok [16:35:00] !log krenair@tin Synchronized docroot/wikipedia.org/apple-app-site-association: https://gerrit.wikimedia.org/r/#/c/250897/ (duration: 00m 34s) [16:35:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:35:15] i can follow-up when we plan to "roll this out" to other domains so you can watch for potential impact of clients downloading this file [16:35:16] bgerstle: https://en.m.wikipedia.org/apple-app-site-association [16:35:37] Krenair: so, is this on betalabs now too? [16:35:40] Also http://en.wikipedia.beta.wmflabs.org/apple-app-site-association [16:36:12] and http://en.m.wikipedia.beta.wmflabs.org/apple-app-site-association [16:36:15] lgtm [16:36:16] 👍🏻 [16:38:15] PROBLEM - YARN NodeManager Node-State on analytics1032 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [16:38:41] (03CR) 10Alex Monk: [C: 032] Enable VisualEditor for draft namespace in fawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/251674 (https://phabricator.wikimedia.org/T118060) (owner: 10Ladsgroup) [16:39:22] (03Merged) 10jenkins-bot: Enable VisualEditor for draft namespace in fawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/251674 (https://phabricator.wikimedia.org/T118060) (owner: 10Ladsgroup) [16:39:48] 6operations, 10OTRS: Upgrade OTRS to a more recent stable release - https://phabricator.wikimedia.org/T74109#1793858 (10akosiaris) >>! In T74109#1793847, @Steinsplitter wrote: >>>! In T74109#1757895, @akosiaris wrote: >> I 've upgraded the test installation today to OTRS version 5.0.1. There is one thing that... [16:40:04] RECOVERY - YARN NodeManager Node-State on analytics1032 is OK: OK: YARN NodeManager analytics1032.eqiad.wmnet:8041 Node-State: RUNNING [16:40:25] !log krenair@tin Synchronized wmf-config/InitialiseSettings.php: https://gerrit.wikimedia.org/r/#/c/251674/ (duration: 00m 35s) [16:40:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:40:32] is one server slow? [16:40:49] James_F, ^ [16:40:56] Ta. [16:41:25] Isn't it the whole sync of git stuff to codfw that's slow [16:41:26] ? [16:42:13] yes [16:42:14] Krenair: Yup, working. [16:42:58] 6operations, 10OTRS: Upgrade OTRS to a more recent stable release - https://phabricator.wikimedia.org/T74109#1793861 (10akosiaris) >>! In T74109#1793858, @akosiaris wrote: >>>! In T74109#1793847, @Steinsplitter wrote: >>>>! In T74109#1757895, @akosiaris wrote: >>> I 've upgraded the test installation today to... [16:43:33] (03PS1) 10Faidon Liambotis: Add cr2-esams to monitoring tools [puppet] - 10https://gerrit.wikimedia.org/r/251983 [16:43:35] (03PS1) 10Faidon Liambotis: network: monitor mr1 OOB links too [puppet] - 10https://gerrit.wikimedia.org/r/251984 [16:43:37] akosiaris: your fast as well :) cool. thans a lot [16:44:25] !log krenair@tin Synchronized php-1.27.0-wmf.5/extensions/VisualEditor/modules/ve-mw/init: https://gerrit.wikimedia.org/r/#/c/251972/ (duration: 00m 34s) [16:44:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:44:29] James_F, ^ [16:47:07] Krenair: Looks to be working fine. [16:47:34] (03PS3) 10Alex Monk: Add patroller group to sawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/250454 (https://phabricator.wikimedia.org/T117314) (owner: 10Luke081515) [16:47:36] 6operations, 10Gitblit: Accessing raw link on git.wikimedia.org causes "Error Sorry, the repository mediawiki does not have a extensions branch!" - https://phabricator.wikimedia.org/T118156#1793885 (10Paladox) Please read https://github.com/gitblit/gitblit/issues/949#issuecomment-155110940 According to the au... [16:47:40] (03CR) 10Alexandros Kosiaris: [C: 031] network: monitor mr1 OOB links too [puppet] - 10https://gerrit.wikimedia.org/r/251984 (owner: 10Faidon Liambotis) [16:48:15] (03CR) 10Alexandros Kosiaris: [C: 031] Add cr2-esams to monitoring tools [puppet] - 10https://gerrit.wikimedia.org/r/251983 (owner: 10Faidon Liambotis) [16:48:17] (03CR) 10Alex Monk: [C: 032] Add patroller group to sawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/250454 (https://phabricator.wikimedia.org/T117314) (owner: 10Luke081515) [16:48:28] (03CR) 10Faidon Liambotis: [C: 032] Add cr2-esams to monitoring tools [puppet] - 10https://gerrit.wikimedia.org/r/251983 (owner: 10Faidon Liambotis) [16:48:36] (03CR) 10Faidon Liambotis: [C: 032] network: monitor mr1 OOB links too [puppet] - 10https://gerrit.wikimedia.org/r/251984 (owner: 10Faidon Liambotis) [16:48:44] (03Merged) 10jenkins-bot: Add patroller group to sawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/250454 (https://phabricator.wikimedia.org/T117314) (owner: 10Luke081515) [16:49:03] Krenair: … hmm. [16:49:16] James_F, ? [16:49:51] Krenair: No, never mind, think I pressed the wrong button. [16:49:58] :) [16:50:51] !log krenair@tin Synchronized wmf-config/InitialiseSettings.php: https://gerrit.wikimedia.org/r/#/c/250454/ (duration: 00m 34s) [16:50:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:52:54] Huh: https://wg-en.m.wikipedia.org/apple-app-site-association [16:53:05] apparently that domains isn't in DNS? [16:53:14] But the non-m version is. Helpful. [16:53:36] Oh well, it's an old locked wiki. [16:53:40] I wonder if there are any more though. [16:54:12] _joe_: wasn't the tools-proxies, since toolserver.org is different (just hosts redirects) in a different module [16:54:39] YuviPanda: we tracked it down thanks, a few question on possibly collapsing things but for later :) [16:54:47] Krinkle, you around? [16:54:47] /win go 36 [16:54:58] chasemp: kk [16:55:04] YuviPanda: it seems to have just been overzealous requests tanking a one core vm that is still doing a bit of user redirect traffic [16:55:16] yeah [16:55:21] it's also apache for no reason instead of just nginx [16:57:01] (03CR) 10Sbgujarat: "I have seen this and this is good." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/250454 (https://phabricator.wikimedia.org/T117314) (owner: 10Luke081515) [17:03:16] (03CR) 10NehalDaveND: "This is the best." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/250454 (https://phabricator.wikimedia.org/T117314) (owner: 10Luke081515) [17:07:14] 6operations, 6Discovery, 5codfw-rollout: [EPIC] Set up a CirrusSearch cluster in codfw (Dallas, Texas) - https://phabricator.wikimedia.org/T105703#1793970 (10chasemp) ping'd on irc but, is this epic task now done? anything remaining? [17:23:32] (03PS1) 10Luke081515: Set throttle exception for account creation on hewiki and commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/252010 (https://phabricator.wikimedia.org/T118122) [17:26:20] (03CR) 10RobH: "this patchset allows them to delete keys as well, did we want this to allow reinstall as well as initial install? (Less oversight.)" [puppet] - 10https://gerrit.wikimedia.org/r/249483 (https://phabricator.wikimedia.org/T116884) (owner: 10Dzahn) [17:26:25] PROBLEM - puppet last run on netmon1001 is CRITICAL: CRITICAL: puppet fail [17:34:55] PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 39, down: 14, dormant: 0, excluded: 0, unused: 0BRet-0/2/1: down - BRxe-0/1/3: down - BRxe-0/1/0: down - BRxe-0/1/7: down - BRxe-0/1/8: down - BRxe-0/1/9: down - BRxe-0/1/6: down - BRxe-0/1/11: down - BRxe-0/1/4: down - BRxe-0/1/10: down - BRxe-0/1/1: down - BRxe-0/1/2: down - BRxe-0/1/5: down - BRet-0/2/2: down - BR [17:35:25] (03CR) 10Alex Monk: Allow import from any Labs/Beta Cluster project to any other (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/157338 (https://phabricator.wikimedia.org/T17583) (owner: 10TTO) [17:36:15] PROBLEM - HTTP 5xx req/min on graphite1001 is CRITICAL: CRITICAL: 7.69% of data above the critical threshold [500.0] [17:38:15] RECOVERY - HTTP 5xx req/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [17:38:31] (03PS10) 10Alex Monk: Allow import from any Labs/Beta Cluster project to any other [mediawiki-config] - 10https://gerrit.wikimedia.org/r/157338 (https://phabricator.wikimedia.org/T17583) (owner: 10TTO) [17:38:35] (03PS1) 10Luke081515: Add new group to enwikiversity [mediawiki-config] - 10https://gerrit.wikimedia.org/r/252012 (https://phabricator.wikimedia.org/T113109) [17:40:34] (03CR) 10Luke081515: "Please look at the community consensus again, to make sure, that enough users agreed." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/252012 (https://phabricator.wikimedia.org/T113109) (owner: 10Luke081515) [17:42:01] (03CR) 10CSteipp: "> How does this work with CentralAuth...? Can't the attacker just login on another wiki and use autologin to get access to enwiki?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/251678 (owner: 10CSteipp) [17:45:55] 6operations, 6Discovery, 5codfw-rollout: [EPIC] Set up a CirrusSearch cluster in codfw (Dallas, Texas) - https://phabricator.wikimedia.org/T105703#1794100 (10Deskana) 5Open>3Resolved a:3Deskana >>! In T105703#1793970, @chasemp wrote: > ping'd on irc but, is this epic task now done? anything remaining?... [17:48:29] 6operations, 10ops-codfw: es2010 failed disk (degraded RAID) - https://phabricator.wikimedia.org/T117848#1794111 (10jcrespo) @mark^ [17:50:58] 6operations, 10Beta-Cluster-Infrastructure, 7Blocked-on-RelEng, 7HHVM, 5Patch-For-Review: Convert work machines (tin, terbium) to Trusty and hhvm usage - https://phabricator.wikimedia.org/T87036#1794121 (10mmodell) Is this really blocked on #blocked-on-releng? [17:51:34] (03CR) 10Ori.livneh: [C: 031] Allow import from any Labs/Beta Cluster project to any other [mediawiki-config] - 10https://gerrit.wikimedia.org/r/157338 (https://phabricator.wikimedia.org/T17583) (owner: 10TTO) [17:53:26] RECOVERY - puppet last run on netmon1001 is OK: OK: Puppet is currently enabled, last run 23 seconds ago with 0 failures [17:55:08] !log upgrading pybal (1.10 -> 1.12) on lvs200[456].codfw.wmnet [17:55:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [17:57:24] (03PS4) 10Faidon Liambotis: Switch Central/South Asia to esams [dns] - 10https://gerrit.wikimedia.org/r/239072 [17:58:31] (03PS5) 10Alexandros Kosiaris: rubocop: Ignoring Style/WordArray offense [puppet] - 10https://gerrit.wikimedia.org/r/238778 (https://phabricator.wikimedia.org/T112651) (owner: 10Zfilipin) [17:58:45] (03CR) 10Alexandros Kosiaris: [C: 031] rubocop: Ignoring Style/WordArray offense [puppet] - 10https://gerrit.wikimedia.org/r/238778 (https://phabricator.wikimedia.org/T112651) (owner: 10Zfilipin) [18:00:03] (03CR) 10Alexandros Kosiaris: [C: 032 V: 032] rubocop: Ignoring Style/WordArray offense [puppet] - 10https://gerrit.wikimedia.org/r/238778 (https://phabricator.wikimedia.org/T112651) (owner: 10Zfilipin) [18:03:14] mobrovac: h-o ? [18:03:21] nevermind [18:04:08] PROBLEM - Host labtestservices2001 is DOWN: PING CRITICAL - Packet loss = 100% [18:18:06] (03PS11) 10Krinkle: Enable import from any Beta Cluster project to another [mediawiki-config] - 10https://gerrit.wikimedia.org/r/157338 (https://phabricator.wikimedia.org/T17583) (owner: 10TTO) [18:19:47] (03CR) 10Luke081515: [C: 031] Enable import from any Beta Cluster project to another [mediawiki-config] - 10https://gerrit.wikimedia.org/r/157338 (https://phabricator.wikimedia.org/T17583) (owner: 10TTO) [18:23:59] (03CR) 10Krinkle: Enable import from any Beta Cluster project to another (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/157338 (https://phabricator.wikimedia.org/T17583) (owner: 10TTO) [18:41:12] (03CR) 10Dzahn: servermon: add ferm rules for http/https (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/251552 (https://phabricator.wikimedia.org/T105410) (owner: 10Dzahn) [18:41:14] (03PS2) 10Dzahn: servermon: add ferm rules for http/https [puppet] - 10https://gerrit.wikimedia.org/r/251552 (https://phabricator.wikimedia.org/T105410) [18:41:59] (03PS3) 10Dzahn: librenms: add ferm rules for http/https [puppet] - 10https://gerrit.wikimedia.org/r/251550 (https://phabricator.wikimedia.org/T105410) [18:42:13] (03CR) 10Dzahn: [C: 032] librenms: add ferm rules for http/https [puppet] - 10https://gerrit.wikimedia.org/r/251550 (https://phabricator.wikimedia.org/T105410) (owner: 10Dzahn) [18:42:43] !log restarting mysql in db1060 to test new performance configuration [18:42:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [18:43:23] (03CR) 10Dzahn: "as opposed to servermon, this is also on 443." [puppet] - 10https://gerrit.wikimedia.org/r/251550 (https://phabricator.wikimedia.org/T105410) (owner: 10Dzahn) [18:44:04] (03Abandoned) 10Dzahn: dns::recursor: move 'standard' and v6 IP to role [puppet] - 10https://gerrit.wikimedia.org/r/250616 (owner: 10Dzahn) [18:44:11] in context of the train deployments, Commons is in group0 (All non-Wikipedia sites), not group1 (All Wikipedias), correct? [18:44:30] (i know that some tools consider it a Wikipedia with language "commons", so asking to be sure) [18:49:05] I don't think group0 is all non-wikipedia sites, is it? [18:49:55] group 0: MediaWiki.org; test.wikipedia.org; test2.wikipedia.org; test.wikidata.org; zero.wikimedia.org [18:50:04] group 1: All non-Wikipedia sites [18:50:11] group 2: All Wikipedias [18:50:54] sorry, s/1/2/, s/0/1/ [18:51:27] so, what is the answer to the corrected question? :) [18:51:34] Some tools consider commons to be a wikipedia because it ends with that 'wiki' prefix [18:51:45] commons is in group1 [18:51:49] ie: wednesday [18:52:14] https://wikitech.wikimedia.org/wiki/Deployments/One_week [18:52:18] answers such questions :) [18:52:22] sorry, suffix. not prefix. [18:52:26] historical reasons probably [18:52:35] at some point there was just wikipedia [18:52:42] and now we can't rename databases [18:52:48] so we're stuck with it [18:53:31] honestly one of the things I hate most about our particular configuration of mediawiki [18:56:06] and I just made a query 300x faster [18:56:21] commonswiki. wikidatawiki. clearly wikipedias [18:56:40] (right. more food, less snark) [18:57:42] my favourite example is sourceswiki [18:58:15] guess the domain, and guess what site a bunch of our software thinks it is [18:58:33] (03PS2) 10Muehlenhoff: Assign salt grains for cp* hosts [puppet] - 10https://gerrit.wikimedia.org/r/251949 [18:58:33] jynus: awesome way to start the week! [18:59:15] well, 'favourite' [18:59:57] PROBLEM - puppet last run on lvs3001 is CRITICAL: CRITICAL: puppet fail [19:01:20] (03CR) 10Muehlenhoff: [C: 032 V: 032] Assign salt grains for cp* hosts [puppet] - 10https://gerrit.wikimedia.org/r/251949 (owner: 10Muehlenhoff) [19:01:45] (03PS1) 10BryanDavis: Monolog: Disable microsecond timestamps on all loggers [mediawiki-config] - 10https://gerrit.wikimedia.org/r/252017 (https://phabricator.wikimedia.org/T116550) [19:06:53] (03PS1) 10Muehlenhoff: Use role spare for calcium [puppet] - 10https://gerrit.wikimedia.org/r/252018 [19:07:49] (03CR) 10Dzahn: [C: 032] Use role spare for calcium [puppet] - 10https://gerrit.wikimedia.org/r/252018 (owner: 10Muehlenhoff) [19:09:07] PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 11 data above and 9 below the confidence bounds [19:12:02] 6operations: reclaim calcium to spares - https://phabricator.wikimedia.org/T116790#1794376 (10Dzahn) also see T105553 and T83044 [19:12:29] (03PS2) 10BryanDavis: Monolog: Disable microsecond timestamps on all loggers [mediawiki-config] - 10https://gerrit.wikimedia.org/r/252017 (https://phabricator.wikimedia.org/T116550) [19:17:45] (03PS1) 10Muehlenhoff: Assign salt grains for analytics::mysql::meta role [puppet] - 10https://gerrit.wikimedia.org/r/252020 [19:18:55] (03PS2) 10RobH: admin: hoo and jzerebecki for wdqs admins [puppet] - 10https://gerrit.wikimedia.org/r/249027 (https://phabricator.wikimedia.org/T116702) (owner: 10Dzahn) [19:19:49] (03CR) 10RobH: [C: 032] admin: hoo and jzerebecki for wdqs admins [puppet] - 10https://gerrit.wikimedia.org/r/249027 (https://phabricator.wikimedia.org/T116702) (owner: 10Dzahn) [19:21:18] 10Ops-Access-Requests, 6operations, 5Patch-For-Review: wdqs-admin group membership for Marius Hoch (hoo) and Jan Zerebecki - https://phabricator.wikimedia.org/T116702#1794389 (10RobH) 5Open>3Resolved a:3RobH This was approved in the operations meeting, and has been merged live. Once the affected hosts... [19:21:37] (03CR) 10Dzahn: "the sshd config part comes from this in hiera:" [puppet] - 10https://gerrit.wikimedia.org/r/250659 (owner: 10Muehlenhoff) [19:22:06] (03PS2) 10Muehlenhoff: Assign salt grains for analytics::mysql::meta role [puppet] - 10https://gerrit.wikimedia.org/r/252020 [19:22:13] (03CR) 10Muehlenhoff: [C: 032 V: 032] Assign salt grains for analytics::mysql::meta role [puppet] - 10https://gerrit.wikimedia.org/r/252020 (owner: 10Muehlenhoff) [19:22:33] (03CR) 10Dzahn: ".. and only now that would be actually applied because the role keyword was not first before. but maybe we can meanwhile remove these exce" [puppet] - 10https://gerrit.wikimedia.org/r/250659 (owner: 10Muehlenhoff) [19:23:31] andrewbogott: is this (still) true? "paramiko needs to to ssh into silver to support designate" [19:23:50] yes [19:23:57] (afaict) [19:24:19] oh lol, maybe it didn't work so far? [19:24:22] hmm [19:24:25] YuviPanda: ok, thanks. the comment said "ssh into these" but it is in role/common/nova/controller.yaml [19:24:32] yes, it didnt work so far [19:24:52] and now we can either make it work or remove the exception for it as opposed to regular sshd [19:24:56] right [19:25:01] I guess andrewbogott will know more [19:25:05] *nod* ok [19:25:08] RECOVERY - puppet last run on lvs3001 is OK: OK: Puppet is currently enabled, last run 49 seconds ago with 0 failures [19:27:09] (03PS1) 10Bmansurov: Enable RelatedArticles on beta labs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/252022 [19:30:48] (03PS2) 10Bmansurov: Enable RelatedArticles on beta labs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/252022 (https://phabricator.wikimedia.org/T116707) [19:34:40] 10Ops-Access-Requests, 6operations: Requesting access to rest base and cassandra nodes - https://phabricator.wikimedia.org/T117473#1794429 (10RobH) a:3Nuria @Nuria: Do you know what group grants the permissions required? I don't see any aqs named groups, other than the aqs-admins, which is allowing restarti... [19:35:25] (03PS3) 10Dzahn: admin: let dc-ops sign puppet certs, add salt keys [puppet] - 10https://gerrit.wikimedia.org/r/249483 (https://phabricator.wikimedia.org/T116884) [19:36:26] (03PS4) 10Dzahn: admin: let dc-ops sign puppet certs, add salt keys [puppet] - 10https://gerrit.wikimedia.org/r/249483 (https://phabricator.wikimedia.org/T116884) [19:36:42] (03PS1) 10Jcrespo: Pool db1060 after performance patch [mediawiki-config] - 10https://gerrit.wikimedia.org/r/252023 [19:38:09] (03PS2) 10Jcrespo: Pool db1060 after performance patch [mediawiki-config] - 10https://gerrit.wikimedia.org/r/252023 [19:38:11] (03CR) 10Dzahn: [C: 032] "approved in ops meeting" [puppet] - 10https://gerrit.wikimedia.org/r/249483 (https://phabricator.wikimedia.org/T116884) (owner: 10Dzahn) [19:39:11] (03CR) 10Jcrespo: [C: 032] Pool db1060 after performance patch [mediawiki-config] - 10https://gerrit.wikimedia.org/r/252023 (owner: 10Jcrespo) [19:40:50] !log jynus@tin Synchronized wmf-config/db-eqiad.php: Pool db1060 after performance patch (duration: 00m 35s) [19:40:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [19:41:01] 6operations, 7Monitoring: Migrate monitoring alerts from watchmouse to catchpoint - https://phabricator.wikimedia.org/T107092#1794447 (10RobH) 5stalled>3declined We have alerting via email for catchpoint, and these monitor different things (plus nimsoft wasn't reliable.) As such, I'm going to decline this... [19:41:18] 10Ops-Access-Requests, 6operations, 5Patch-For-Review: let datacenter-ops sign puppet certs and accept salt keys - https://phabricator.wikimedia.org/T116884#1794449 (10Dzahn) was approved in ops meeting. merged. on palladium: +%datacenter-ops ALL = NOPASSWD: /usr/bin/salt-key * +%datacenter-ops ALL = NOPAS... [19:41:25] 10Ops-Access-Requests, 6operations, 5Patch-For-Review: let datacenter-ops sign puppet certs and accept salt keys - https://phabricator.wikimedia.org/T116884#1794452 (10Dzahn) a:3Dzahn [19:41:37] 6operations, 6Phabricator: Moving procurement from RT to Phabricator - https://phabricator.wikimedia.org/T93760#1794455 (10RobH) 5Open>3Resolved I resolved the last RT procurement tickets today, and now we use phabricator for all procurement. [19:43:15] 10Ops-Access-Requests, 6operations, 5Patch-For-Review: let datacenter-ops sign puppet certs and accept salt keys - https://phabricator.wikimedia.org/T116884#1760888 (10Dzahn) @papaul see above, when you are back from your vacation and get to install a server again, you can now do the puppet and salt part. l... [19:43:21] 10Ops-Access-Requests, 6operations, 5Patch-For-Review: let datacenter-ops sign puppet certs and accept salt keys - https://phabricator.wikimedia.org/T116884#1794466 (10Dzahn) 5Open>3Resolved [19:44:26] RECOVERY - Host labtestservices2001 is UP: PING OK - Packet loss = 0%, RTA = 35.62 ms [19:46:35] 6operations, 6Performance-Team, 10Traffic: Support HTTP/2 - https://phabricator.wikimedia.org/T96848#1794476 (10ori) [19:46:50] robh, hey, so what's left in RT? [19:46:56] 6operations, 10ops-ulsfo: populate spares data for ulsfo - https://phabricator.wikimedia.org/T118207#1794478 (10RobH) 3NEW a:3RobH [19:48:08] PROBLEM - puppet last run on labtestservices2001 is CRITICAL: Connection refused by host [19:48:08] PROBLEM - Disk space on labtestservices2001 is CRITICAL: Connection refused by host [19:48:26] PROBLEM - salt-minion processes on labtestservices2001 is CRITICAL: Connection refused by host [19:48:48] PROBLEM - RAID on labtestservices2001 is CRITICAL: Connection refused by host [19:49:17] PROBLEM - configured eth on labtestservices2001 is CRITICAL: Connection refused by host [19:49:37] PROBLEM - dhclient process on labtestservices2001 is CRITICAL: Connection refused by host [19:49:37] PROBLEM - DPKG on labtestservices2001 is CRITICAL: Connection refused by host [19:49:41] Krenair: maint-announce [19:49:49] https://phabricator.wikimedia.org/T118176 [19:50:00] is that for the datacenters to send announcements to? [19:50:07] Once that is done we'll kill the RT mail relays and shove it into a ganeti VM [19:50:14] and carriers [19:50:23] datacenter and carrier vendors. [19:50:37] you're going to move RT to a VM? [19:50:45] what will it actually have other than historical records? [19:50:48] labtestservices2001 is me I guess, it shoudn't be alerting [19:50:49] we cannot kill it entirely as not every ticket was migrated [19:50:57] and we may need it to reference past purchase approvals [19:51:03] (as we never migrated the old procurement queue) [19:51:16] so you're going to keep it alive indefinitely? :/ [19:51:25] not my idea. [19:51:26] 6operations, 6Performance-Team, 10Traffic: Support HTTP/2 - https://phabricator.wikimedia.org/T96848#1794505 (10Peter) FYI: I've been filing bugs for SPDY/Chrome and WebPageTest and Pat Meenan (of the Chrome team) reminded me that Chrome will drop support for SPDY early 2016. He also said the team will reach... [19:51:29] heh, ok [19:52:18] so ideally someone imports the data and marks in resolved or declined(for rejected tickets) [19:52:26] or we'll strip it to html like bz [19:52:44] its unclear if one of those will happen, or if its a vm with full RT stack minus mail [19:55:03] (03PS6) 10Alexandros Kosiaris: rubocop: Ignoring Style/WordArray offense [puppet] - 10https://gerrit.wikimedia.org/r/238778 (https://phabricator.wikimedia.org/T112651) (owner: 10Zfilipin) [19:55:10] (03CR) 10Alexandros Kosiaris: [V: 032] rubocop: Ignoring Style/WordArray offense [puppet] - 10https://gerrit.wikimedia.org/r/238778 (https://phabricator.wikimedia.org/T112651) (owner: 10Zfilipin) [19:55:44] (03PS3) 10Alexandros Kosiaris: Update servermon configuration for 0.7 [puppet] - 10https://gerrit.wikimedia.org/r/223347 [19:56:59] YuviPanda, mutante, sorry will catch up after meeting [19:58:25] (03PS5) 10BryanDavis: Prepare to enable QuickSurveys in production [mediawiki-config] - 10https://gerrit.wikimedia.org/r/251130 (https://phabricator.wikimedia.org/T110661) (owner: 10Jdlrobson) [19:59:41] 6operations, 6Performance-Team, 10Traffic: Support HTTP/2 - https://phabricator.wikimedia.org/T96848#1794518 (10Krinkle) p:5Normal>3High [20:01:10] (03CR) 10Alexandros Kosiaris: [C: 032] Update servermon configuration for 0.7 [puppet] - 10https://gerrit.wikimedia.org/r/223347 (owner: 10Alexandros Kosiaris) [20:01:12] (03PS1) 10Rush: labtest*.codfw.wmnet definitions [dns] - 10https://gerrit.wikimedia.org/r/252025 (https://phabricator.wikimedia.org/T117097) [20:01:28] PROBLEM - Host labtestservices2001 is DOWN: PING CRITICAL - Packet loss = 100% [20:04:12] (03CR) 10Andrew Bogott: [C: 031] "I keep reading 'metal' as 'meta'" [dns] - 10https://gerrit.wikimedia.org/r/252025 (https://phabricator.wikimedia.org/T117097) (owner: 10Rush) [20:04:26] (03CR) 10Jhobs: [C: 031] "Config change to enable on testwiki (and enable first survey) to come after this rides the train on Tuesday." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/251130 (https://phabricator.wikimedia.org/T110661) (owner: 10Jdlrobson) [20:05:33] (03CR) 10BryanDavis: [C: 031] "Changed this patch to only setup the extension for l10n and keep it disabled everywhere. Migrates things from being setup only for beta cl" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/251130 (https://phabricator.wikimedia.org/T110661) (owner: 10Jdlrobson) [20:51:28] (03PS1) 10Muehlenhoff: Assign salt grains for logstash elastic nodes [puppet] - 10https://gerrit.wikimedia.org/r/252106 [20:52:30] (03PS2) 10Muehlenhoff: Assign salt grains for logstash elastic nodes [puppet] - 10https://gerrit.wikimedia.org/r/252106 [20:52:52] (03CR) 10Muehlenhoff: [C: 032 V: 032] Assign salt grains for logstash elastic nodes [puppet] - 10https://gerrit.wikimedia.org/r/252106 (owner: 10Muehlenhoff) [20:56:47] 6operations, 6Performance-Team, 10Traffic: Support HTTP/2 - https://phabricator.wikimedia.org/T96848#1794563 (10ori) @Peter, could you perhaps connect Pat Meenan and @BBlack, so we can ask the Chrome team to not drop SPDY before the Nginx situation is resolved? (And perhaps to use their clout to ask Nginx to... [21:00:04] gwicke cscott arlolra subbu bearND mdholloway: Respected human, time to deploy Services – Parsoid / OCG / Citoid / Mobileapps / … (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20151109T2100). Please do the needful. [21:00:39] No deploy today. [21:00:41] lol. [21:00:52] no deploys [21:01:17] what's the deal with eventlog2001, it's no longer in site.pp, but e.g. shows up in servermon? [21:01:17] heh [21:11:22] (03PS1) 10Muehlenhoff: Assign salt grains for labs::db::slave and labs::db::master roles [puppet] - 10https://gerrit.wikimedia.org/r/252112 [21:12:34] 6operations, 6Performance-Team, 10Traffic: Support HTTP/2 - https://phabricator.wikimedia.org/T96848#1794583 (10ori) @BBlack: https://grafana.wikimedia.org/dashboard/db/client-connections?panelId=5&fullscreen&from=1446498648837&to=1447103448837 shows support for SPDY ranging from ~62% to ~71%, which is at od... [21:12:41] (03CR) 10Muehlenhoff: [C: 032 V: 032] Assign salt grains for labs::db::slave and labs::db::master roles [puppet] - 10https://gerrit.wikimedia.org/r/252112 (owner: 10Muehlenhoff) [21:15:32] 6operations, 6Performance-Team, 10Traffic: Support HTTP/2 - https://phabricator.wikimedia.org/T96848#1794586 (10BBlack) @ori: my stats are from clienthellos, so they're per-**connection**. All of our stats based on X-Connection-Properties are (in some cases, unfortunately) per-**request**, even if several o... [21:16:20] ottomata: statsv does not appear to work at the moment, but the service is up. are you aware of any issues w/kafka? [21:17:17] hm, ori, moritzm and I enabled ferm on kafka1014 today, and in doing so we did a broker restart [21:17:40] ori, ottomata: let me have a look at the logs [21:17:42] ori, do you remember if you are using kafka-python or python-kafka [21:17:43] ? [21:17:48] sorry [21:17:48] or [21:17:49] pykafka* [21:18:10] we are using pykafka for consumption in eventlogging now, and I had to restart some of the consumers when we did this [21:18:12] and i'm not sure why [21:18:14] python-pykafka i think [21:18:30] ottomata: I doubt it's related to the ferm rules, nothing got dropped traffic-wise throughout the day [21:18:35] should i try restarting it, or would you like me to keep it running in its current state? [21:18:49] lemme look real quick, then we can restart [21:18:52] where is it running? [21:18:58] hafnium [21:19:19] oh jessie now..right! :) [21:20:35] 6operations, 10hardware-requests: Allocate hardware for salt master in eqiad - https://phabricator.wikimedia.org/T115288#1794593 (10RobH) a:5mark>3RobH [21:21:14] ori, am looking at syslog there, and I see Nov 9 14:21:07 hafnium python[16995]: CalledProcessError: Command '['phantomjs', '--ssl-protocol=any', 'asset-check.js', 'https://zh.wikipedia.org/?mainpage']' returned non-zero exit status 1 [21:21:35] that's not statsv [21:21:55] hm [21:21:57] k [21:22:20] hard to separate them out in syslog... [21:22:27] let me strace it and see what it's actually doing [21:22:46] service statsv status gives a little info [21:22:47] maybe [21:25:24] waiting on kafka [21:25:25] it looks like [21:25:29] reading but nothing is coming [21:25:44] yeah, we saw this once with pykafka too.. [21:25:49] there is an issue i think [21:25:51] go ahead and restart [21:26:22] https://github.com/Parsely/pykafka/issues/189 [21:29:28] (03CR) 10Andrew Bogott: [C: 032] labtest*.codfw.wmnet definitions [dns] - 10https://gerrit.wikimedia.org/r/252025 (https://phabricator.wikimedia.org/T117097) (owner: 10Rush) [21:30:10] (03PS5) 10RobH: Add perf-roots to Graphite role [puppet] - 10https://gerrit.wikimedia.org/r/249966 (owner: 10Ori.livneh) [21:30:17] RECOVERY - HTTP error ratio anomaly detection on graphite1001 is OK: OK: No anomaly detected [21:31:24] (03CR) 10RobH: [C: 032] "approved in ops meeting." [puppet] - 10https://gerrit.wikimedia.org/r/249966 (owner: 10Ori.livneh) [21:31:48] 10Ops-Access-Requests, 6operations: Requesting access to add perf-roots group to graphite role - https://phabricator.wikimedia.org/T117256#1794612 (10RobH) 5Open>3Resolved a:3RobH This was approved in our operations meeting and is now merged live. It may take up to 30 minutes for affected hosts to call... [21:37:53] (03PS1) 10Ori.livneh: Add perf-roots to webperf role (as part of I583d9a571) [puppet] - 10https://gerrit.wikimedia.org/r/252114 [21:39:12] (03CR) 10Alexandros Kosiaris: "@Subbu, yes Nov 4" [puppet] - 10https://gerrit.wikimedia.org/r/249399 (owner: 10Subramanya Sastry) [21:39:36] akosiaris, thanks [21:40:13] subbu: you 're welcome [21:49:17] PROBLEM - Unmerged changes on repository puppet on strontium is CRITICAL: There is one unmerged change in puppet (dir /var/lib/git/operations/puppet). [21:50:12] hmm, ori, i think you are using kafka-python? [21:50:17] (03PS1) 10Muehlenhoff: Assign salt grains for db::redis role [puppet] - 10https://gerrit.wikimedia.org/r/252116 [21:50:24] yes, which commits offsets to zk instead of kafka, (old way) [21:51:25] ori, fyi, we recently installed burrow on krypton, and if you us pykafka and commit offsets to kafka, you can get monitoring about consumer lag and status [21:51:26] 6operations: install/setup/deploy neodymium as salt-master in eqiad - https://phabricator.wikimedia.org/T118210#1794623 (10RobH) 3NEW a:3RobH [21:51:38] e.g. curl http://krypton.eqiad.wmnet:8000/v2/kafka/eqiad/consumer [21:51:39] 6operations: install/setup/deploy neodymium as salt-master in eqiad - https://phabricator.wikimedia.org/T118210#1794633 (10RobH) [21:51:40] 6operations, 10hardware-requests: Allocate hardware for salt master in eqiad - https://phabricator.wikimedia.org/T115288#1720246 (10RobH) [21:51:52] and curl http://krypton.eqiad.wmnet:8000/v2/kafka/eqiad/consumer/eventlogging-00/topic/eventlogging-client-side [21:52:12] !log MobileApps deployed sha1 6c63984 [21:52:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [21:52:26] 6operations, 10Salt: Move salt master to separate host from puppet master - https://phabricator.wikimedia.org/T115287#1794641 (10RobH) [21:52:27] 6operations: install/setup/deploy lawrencium as eqiad salt-master - https://phabricator.wikimedia.org/T116645#1794640 (10RobH) [21:52:28] 6operations: install/setup/deploy neodymium as salt-master in eqiad - https://phabricator.wikimedia.org/T118210#1794623 (10RobH) [21:52:30] 6operations, 10hardware-requests: Allocate hardware for salt master in eqiad - https://phabricator.wikimedia.org/T115288#1794635 (10RobH) 5Open>3Resolved Well, neodymium has SSDs, but the OS will be placed on the larger SAS disks. All of the possible allocations are slightly out of spec, this one being we... [21:54:25] ottomata: so what do i need to do (to make statsv compliant) [21:54:59] ori, just used pykafka instead of kafka-python (pykafka is better) (although I don't know if it would have avoided the problem you just poked me about) [21:55:27] http://pykafka.readthedocs.org/en/latest/ [21:55:54] esp. now that hafnium is on jessie, you have a debian python-pykafka package you can use [21:56:17] ori, this is balanced consumer, which you might not need: [21:56:17] https://github.com/wikimedia/mediawiki-extensions-EventLogging/blob/master/server/eventlogging/handlers.py#L471 [21:56:30] (03PS2) 10Muehlenhoff: Assign salt grains for db::redis role [puppet] - 10https://gerrit.wikimedia.org/r/252116 [21:56:43] (03PS1) 10RobH: setting neodymium production dns entries [dns] - 10https://gerrit.wikimedia.org/r/252117 [21:57:24] (03CR) 10Muehlenhoff: [C: 032 V: 032] Assign salt grains for db::redis role [puppet] - 10https://gerrit.wikimedia.org/r/252116 (owner: 10Muehlenhoff) [21:58:47] RECOVERY - Unmerged changes on repository puppet on strontium is OK: No changes to merge. [22:15:13] (03PS1) 10RobH: setting install params for neodymium [puppet] - 10https://gerrit.wikimedia.org/r/252119 [22:16:25] (03CR) 10RobH: [C: 032] setting neodymium production dns entries [dns] - 10https://gerrit.wikimedia.org/r/252117 (owner: 10RobH) [22:16:50] (03PS2) 10RobH: setting install params for neodymium [puppet] - 10https://gerrit.wikimedia.org/r/252119 [22:17:55] (03CR) 10RobH: [C: 032] setting install params for neodymium [puppet] - 10https://gerrit.wikimedia.org/r/252119 (owner: 10RobH) [22:23:10] woo hoo! [22:23:37] PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 10 data above and 1 below the confidence bounds [22:25:07] PROBLEM - Kafka Broker Replica Max Lag on kafka1012 is CRITICAL: CRITICAL: 66.67% of data above the critical threshold [5000000.0] [22:27:18] PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 11 data above and 2 below the confidence bounds [22:30:57] (03CR) 10Dzahn: "i don't really have an opinion here. can you add a commit message that explains a bit what is being changed here" [puppet] - 10https://gerrit.wikimedia.org/r/251836 (owner: 10Paladox) [22:31:07] RECOVERY - Host labtestservices2001 is UP: PING OK - Packet loss = 0%, RTA = 34.85 ms [22:32:20] (03CR) 10Dzahn: [C: 031] "i guess.. for consistency" [puppet] - 10https://gerrit.wikimedia.org/r/251714 (https://phabricator.wikimedia.org/T115067) (owner: 10JanZerebecki) [22:32:36] PROBLEM - puppet last run on mw1172 is CRITICAL: CRITICAL: Puppet has 1 failures [22:33:27] PROBLEM - RAID on analytics1032 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [22:33:36] PROBLEM - YARN NodeManager Node-State on analytics1032 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [22:34:17] PROBLEM - configured eth on analytics1032 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [22:35:34] 6operations: install/setup/deploy neodymium as salt-master in eqiad - https://phabricator.wikimedia.org/T118210#1794706 (10RobH) [22:35:57] RECOVERY - configured eth on analytics1032 is OK: OK - interfaces up [22:36:57] PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 12 data above and 3 below the confidence bounds [22:37:07] RECOVERY - RAID on analytics1032 is OK: OK: optimal, 13 logical, 14 physical [22:37:17] RECOVERY - YARN NodeManager Node-State on analytics1032 is OK: OK: YARN NodeManager analytics1032.eqiad.wmnet:8041 Node-State: RUNNING [22:37:38] 6operations, 6Discovery, 5codfw-rollout: [EPIC] Set up a CirrusSearch cluster in codfw (Dallas, Texas) - https://phabricator.wikimedia.org/T105703#1794708 (10EBernhardson) [22:38:02] 6operations, 6Discovery, 5codfw-rollout: [EPIC] Set up a CirrusSearch cluster in codfw (Dallas, Texas) - https://phabricator.wikimedia.org/T105703#1449703 (10EBernhardson) This is now up and running with a full copy of the index and all writes going to it. We should do a load test and ensure this meets our e... [22:38:27] (03Abandoned) 10Dzahn: logstash::elasticsearch add base::firewall [puppet] - 10https://gerrit.wikimedia.org/r/248918 (https://phabricator.wikimedia.org/T104964) (owner: 10Dzahn) [22:38:28] RECOVERY - Kafka Broker Replica Max Lag on kafka1012 is OK: OK: Less than 1.00% above the threshold [1000000.0] [22:39:14] (03CR) 10Dzahn: "i don't know the current status of this anymore. wish i would" [dns] - 10https://gerrit.wikimedia.org/r/248504 (https://phabricator.wikimedia.org/T599) (owner: 10Dzahn) [22:41:24] 6operations: reclaim rubidium to spares - https://phabricator.wikimedia.org/T118213#1794720 (10RobH) 3NEW a:3RobH [22:42:23] 6operations: reclaim rubidium to spares - https://phabricator.wikimedia.org/T118213#1794732 (10RobH) [22:42:36] (03PS3) 10Dzahn: servermon: add ferm rules for http/https [puppet] - 10https://gerrit.wikimedia.org/r/251552 (https://phabricator.wikimedia.org/T105410) [22:42:48] (03PS1) 10EBernhardson: Enable CirrusSearch writes to labsearch for all but enwiki and dewiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/252120 [22:42:50] (03PS1) 10EBernhardson: Enable CirrusSearch writes to enwiki and dewiki as well [mediawiki-config] - 10https://gerrit.wikimedia.org/r/252121 [22:42:54] (03CR) 10Dzahn: [C: 032] "no-op until base::firewall" [puppet] - 10https://gerrit.wikimedia.org/r/251552 (https://phabricator.wikimedia.org/T105410) (owner: 10Dzahn) [22:43:15] (03PS2) 10Dereckson: Set throttle exception for University of Haifa wiki event [mediawiki-config] - 10https://gerrit.wikimedia.org/r/252010 (https://phabricator.wikimedia.org/T118122) (owner: 10Luke081515) [22:43:22] (03CR) 10Dereckson: [C: 031] Set throttle exception for University of Haifa wiki event [mediawiki-config] - 10https://gerrit.wikimedia.org/r/252010 (https://phabricator.wikimedia.org/T118122) (owner: 10Luke081515) [22:43:27] 6operations: install/setup/deploy neodymium as salt-master in eqiad - https://phabricator.wikimedia.org/T118210#1794735 (10RobH) [22:43:28] (03CR) 10jenkins-bot: [V: 04-1] Enable CirrusSearch writes to enwiki and dewiki as well [mediawiki-config] - 10https://gerrit.wikimedia.org/r/252121 (owner: 10EBernhardson) [22:44:33] (03Abandoned) 10Dzahn: analytics::mysql::meta, move standard/fw to role [puppet] - 10https://gerrit.wikimedia.org/r/250617 (owner: 10Dzahn) [22:44:45] (03PS2) 10EBernhardson: Enable CirrusSearch writes to enwiki and dewiki as well [mediawiki-config] - 10https://gerrit.wikimedia.org/r/252121 [22:46:17] PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 12 data above and 3 below the confidence bounds [22:49:15] !log rebooting analytics1032: https://phabricator.wikimedia.org/T118175 [22:49:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [22:51:58] PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 12 data above and 3 below the confidence bounds [22:52:48] PROBLEM - Host analytics1032 is DOWN: PING CRITICAL - Packet loss = 100% [22:57:37] PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 11 data above and 4 below the confidence bounds [22:58:06] RECOVERY - Host analytics1032 is UP: PING OK - Packet loss = 0%, RTA = 0.32 ms [22:58:47] RECOVERY - puppet last run on mw1172 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [23:01:37] PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 11 data above and 3 below the confidence bounds [23:05:26] PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 10 data above and 4 below the confidence bounds [23:06:16] PROBLEM - configured eth on analytics1032 is CRITICAL: Connection refused by host [23:06:26] PROBLEM - SSH on analytics1032 is CRITICAL: Connection refused [23:06:26] PROBLEM - Check size of conntrack table on analytics1032 is CRITICAL: Connection refused by host [23:06:27] PROBLEM - Disk space on Hadoop worker on analytics1032 is CRITICAL: Connection refused by host [23:06:37] PROBLEM - salt-minion processes on analytics1032 is CRITICAL: Connection refused by host [23:06:47] PROBLEM - DPKG on analytics1032 is CRITICAL: Connection refused by host [23:06:56] PROBLEM - Disk space on analytics1032 is CRITICAL: Connection refused by host [23:06:56] PROBLEM - puppet last run on db2030 is CRITICAL: CRITICAL: puppet fail [23:07:06] PROBLEM - dhclient process on analytics1032 is CRITICAL: Connection refused by host [23:08:07] RECOVERY - configured eth on analytics1032 is OK: OK - interfaces up [23:08:17] RECOVERY - SSH on analytics1032 is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2.3 (protocol 2.0) [23:08:17] RECOVERY - Check size of conntrack table on analytics1032 is OK: OK: nf_conntrack is 0 % full [23:08:17] RECOVERY - Disk space on Hadoop worker on analytics1032 is OK: DISK OK [23:08:36] RECOVERY - salt-minion processes on analytics1032 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [23:08:37] RECOVERY - DPKG on analytics1032 is OK: All packages OK [23:08:47] RECOVERY - Disk space on analytics1032 is OK: DISK OK [23:08:56] RECOVERY - dhclient process on analytics1032 is OK: PROCS OK: 0 processes with command name dhclient [23:10:16] RECOVERY - puppet last run on analytics1032 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [23:11:34] (03CR) 10EBernhardson: [C: 032] Enable CirrusSearch writes to labsearch for all but enwiki and dewiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/252120 (owner: 10EBernhardson) [23:12:02] (03Merged) 10jenkins-bot: Enable CirrusSearch writes to labsearch for all but enwiki and dewiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/252120 (owner: 10EBernhardson) [23:13:47] !log ebernhardson@tin Synchronized wmf-config/InitialiseSettings.php: Enable CirrusSearch writes to labsearch for all but enwiki and dewiki (duration: 00m 35s) [23:13:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:25:56] AaronSchulz: hey i was looking at the job queue, and i'm not sure anything is moving from the delayed jobs into the main job queue. but i could be wrong [23:26:11] AaronSchulz: on the 4 job runners i checked all have 'Caught signal (15)' as the last item in jobchron.log [23:26:19] (15 == sigterm == kill) [23:26:35] but they all looked to be running the service [23:27:36] for an example, i've been watching cirrusSearchIncomingLinkCount on enwiki, which only increases. Also cirrusSearchElasticaWrite which is an uncommonly inserted job, it has been at the same # of delayed jobs for the last 30 minutes although the longest delay it will use is 15 minutes [23:27:46] IncomingLinkCount typically uses <1 minute for delay [23:29:50] (03PS1) 10ArielGlenn: start work on cleaning up 'chunks' handling [dumps] (ariel) - 10https://gerrit.wikimedia.org/r/252131 [23:29:52] (03PS1) 10ArielGlenn: dumps: clean up construction of list of possible dump jobs for wiki [dumps] (ariel) - 10https://gerrit.wikimedia.org/r/252132 [23:29:54] (03PS1) 10ArielGlenn: dumps: clean up many comments of methods for dumps jobs [dumps] (ariel) - 10https://gerrit.wikimedia.org/r/252133 [23:29:56] (03PS1) 10ArielGlenn: dumps: clean up docstrings for recompress jobs [dumps] (ariel) - 10https://gerrit.wikimedia.org/r/252134 [23:31:57] ebernhardson: https://grafana-admin.wikimedia.org/dashboard/db/job-queue-rate shows undelaying of jobs (I just added that stat btw) [23:33:51] AaronSchulz: so it is doing things, just not since i started watching :) [23:34:58] RECOVERY - puppet last run on db2030 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [23:37:04] 6operations: install/setup/deploy neodymium as salt-master in eqiad - https://phabricator.wikimedia.org/T118210#1794854 (10RobH) box has issue, it keeps booting into pxe and not off boot disk. perhaps jessie installs to ssds while system tries to boot of sata? (i'll further investigate later) [23:46:26] PROBLEM - Kafka Broker Replica Max Lag on kafka1020 is CRITICAL: CRITICAL: 50.00% of data above the critical threshold [5000000.0] [23:50:07] PROBLEM - Kafka Broker Replica Max Lag on kafka1020 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [5000000.0] [23:53:33] (03PS1) 10EBernhardson: Revert "Enable CirrusSearch writes to labsearch for all but enwiki and dewiki" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/252136 [23:53:42] (03CR) 10EBernhardson: [C: 032] Revert "Enable CirrusSearch writes to labsearch for all but enwiki and dewiki" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/252136 (owner: 10EBernhardson) [23:54:03] (03Merged) 10jenkins-bot: Revert "Enable CirrusSearch writes to labsearch for all but enwiki and dewiki" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/252136 (owner: 10EBernhardson) [23:54:06] 6operations, 10OTRS: Upgrade OTRS to a more recent stable release - https://phabricator.wikimedia.org/T74109#1794873 (10Reedy) @akosiaris There's a report from @Rjd0060 that the test instance is slow. Is that likely just a result of the "hardware" it's on/resources assigned to it vs the actual production machine? [23:55:03] !log ebernhardson@tin Synchronized wmf-config/InitialiseSettings.php: Disable cirrus writes to labsearch, the machine cant take the load and some jobs are timing out (duration: 00m 34s) [23:55:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master