[00:00:04] addshore, hashar, anomie, RainbowSprinkles, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, thcipriani, Niharika, and zeljkof: #bothumor Q:How do functions break up? A:They stop calling each other. Rise for Evening SWAT (Max 8 patches) deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20171129T0000). [00:00:04] No GERRIT patches in the queue for this window AFAICS. [00:00:33] bblack: yea, it comes from standard unless Hiera tells it not to. can i just add it by hostname for now? [00:02:51] (03PS1) 10Dzahn: bast4002: disable ganglia [puppet] - 10https://gerrit.wikimedia.org/r/393956 [00:04:04] (03PS2) 10Dzahn: bast4002: disable ganglia [puppet] - 10https://gerrit.wikimedia.org/r/393956 [00:04:37] (03CR) 10BBlack: [C: 032] bast4002: disable ganglia [puppet] - 10https://gerrit.wikimedia.org/r/393956 (owner: 10Dzahn) [00:15:13] (03PS1) 10Dzahn: analytics,aqs,kafka: remove ganglia [puppet] - 10https://gerrit.wikimedia.org/r/393962 (https://phabricator.wikimedia.org/T177225) [00:16:31] (03PS2) 10Dzahn: analytics,aqs,kafka: remove ganglia [puppet] - 10https://gerrit.wikimedia.org/r/393962 (https://phabricator.wikimedia.org/T177225) [00:18:41] !log deleted 6 archived files from servers for legal compliance [00:18:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:19:39] (03CR) 10Dzahn: [C: 032] analytics,aqs,kafka: remove ganglia [puppet] - 10https://gerrit.wikimedia.org/r/393962 (https://phabricator.wikimedia.org/T177225) (owner: 10Dzahn) [00:23:16] (03CR) 10Faidon Liambotis: [C: 04-1] apt: add class apt::dpkgconfold and include it from apt::unattendedupgrades (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/392421 (https://phabricator.wikimedia.org/T180811) (owner: 10Arturo Borrero Gonzalez) [00:23:45] jouncebot: now [00:23:46] For the next 0 hour(s) and 36 minute(s): Evening SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20171129T0000) [00:32:18] !log awight@tin Synchronized php-1.31.0-wmf.8/extensions/ORES: Hotfix to mitigate cache stampeding, T181567 (duration: 00m 50s) [00:32:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:32:25] T181567: Rate limit thresholds requests when the service is down - https://phabricator.wikimedia.org/T181567 [00:33:16] !log reedy@tin Synchronized php-1.31.0-wmf.10/includes/logging/LogPager.php: Fix fatal on Special:Log T181565 (duration: 00m 48s) [00:33:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:33:23] T181565: Call to a member function bitAnd() on a non-object (null) in LogPager, so Special:Log doesn't work - https://phabricator.wikimedia.org/T181565 [00:39:23] wait, icinga-wm you didn't tell me about failed puppet.. [00:40:51] damn, issues on a bunch of hosts but i didnt notice because i relied on the bot [00:42:05] mutante: thats odd [00:43:07] yea, but now i want to fix the actual issue before the bot :p [00:44:18] !log awight@tin Synchronized php-1.31.0-wmf.10/extensions/ORES: Hotfix to mitigate cache stampeding, T181567 (duration: 00m 49s) [00:44:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:44:27] T181567: Rate limit thresholds requests when the service is down - https://phabricator.wikimedia.org/T181567 [00:45:15] 10Operations, 10Traffic: Turn up network links for Asia Cache DC - https://phabricator.wikimedia.org/T156031#3795140 (10faidon) [00:47:23] (03PS1) 10Dzahn: phabricator: remove ganglia include [puppet] - 10https://gerrit.wikimedia.org/r/393967 (https://phabricator.wikimedia.org/T177225) [00:47:42] (03PS2) 10Dzahn: phabricator: remove ganglia include [puppet] - 10https://gerrit.wikimedia.org/r/393967 (https://phabricator.wikimedia.org/T177225) [00:48:15] (03CR) 10Dzahn: [C: 032] phabricator: remove ganglia include [puppet] - 10https://gerrit.wikimedia.org/r/393967 (https://phabricator.wikimedia.org/T177225) (owner: 10Dzahn) [01:01:57] (03PS1) 10Dzahn: ganglia: remove service{} from decom class [puppet] - 10https://gerrit.wikimedia.org/r/393970 (https://phabricator.wikimedia.org/T177225) [01:03:24] (03PS2) 10Dzahn: ganglia: remove service{} from decom class [puppet] - 10https://gerrit.wikimedia.org/r/393970 (https://phabricator.wikimedia.org/T177225) [01:04:01] (03CR) 10Dzahn: [C: 032] ganglia: remove service{} from decom class [puppet] - 10https://gerrit.wikimedia.org/r/393970 (https://phabricator.wikimedia.org/T177225) (owner: 10Dzahn) [01:07:44] !log forcing puppet run on all labvirt* machines to clean out Icinga alerts [01:07:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:09:05] !log restarting ircecho - it stopped talking [01:09:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:09:42] RECOVERY - puppet last run on labvirt1009 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [01:09:42] RECOVERY - puppet last run on labvirt1012 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [01:10:12] RECOVERY - puppet last run on labvirt1013 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [01:10:12] RECOVERY - puppet last run on labvirt1004 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [01:10:23] RECOVERY - puppet last run on labtestnet2001 is OK: OK: Puppet is currently enabled, last run 28 seconds ago with 0 failures [01:10:23] RECOVERY - puppet last run on labtestneutron2001 is OK: OK: Puppet is currently enabled, last run 30 seconds ago with 0 failures [01:10:24] ^ well that would have popped up earlier, normally [01:10:33] the bot needed a kick [01:10:42] RECOVERY - puppet last run on labvirt1011 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [01:10:42] RECOVERY - puppet last run on labcontrol1003 is OK: OK: Puppet is currently enabled, last run 1 second ago with 0 failures [01:10:42] RECOVERY - puppet last run on labvirt1006 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [01:10:43] RECOVERY - puppet last run on labvirt1002 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [01:11:02] RECOVERY - puppet last run on labvirt1017 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [01:11:02] RECOVERY - puppet last run on labtestmetal2001 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [01:11:03] RECOVERY - puppet last run on labvirt1007 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [01:11:03] RECOVERY - puppet last run on labtestnet2002 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [01:11:42] RECOVERY - puppet last run on labvirt1001 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [01:11:42] RECOVERY - puppet last run on labtestservices2002 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [01:11:43] RECOVERY - puppet last run on labtestvirt2003 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [01:11:52] RECOVERY - puppet last run on labvirt1003 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [01:11:53] RECOVERY - puppet last run on labvirt1008 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [01:12:02] RECOVERY - puppet last run on labvirt1018 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [01:12:04] RECOVERY - puppet last run on labtestservices2001 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [01:12:23] RECOVERY - puppet last run on labvirt1015 is OK: OK: Puppet is currently enabled, last run 10 seconds ago with 0 failures [01:12:52] RECOVERY - puppet last run on labtestcontrol2001 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [01:13:08] 10Operations, 10ops-eqiad: Lost network connectivity on mw1276 - https://phabricator.wikimedia.org/T181397#3788499 (10Dzahn) When there are hardware issues, let's close the ticket only after servers actually get repooled. Because otherwise we keep forgetting that. Noticed it from Icinga saying: "Host mw1276 i... [01:13:22] RECOVERY - puppet last run on labtestservices2003 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [01:13:32] RECOVERY - puppet last run on labtestneutron2002 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [01:14:43] RECOVERY - puppet last run on labtestvirt2001 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [01:14:43] RECOVERY - puppet last run on labtestvirt2002 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [01:16:12] RECOVERY - puppet last run on labstore1003 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [01:16:52] RECOVERY - puppet last run on labcontrol1002 is OK: OK: Puppet is currently enabled, last run 52 seconds ago with 0 failures [01:17:02] RECOVERY - puppet last run on labnet1002 is OK: OK: Puppet is currently enabled, last run 33 seconds ago with 0 failures [01:17:12] RECOVERY - puppet last run on silver is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [01:17:13] RECOVERY - puppet last run on labnet1001 is OK: OK: Puppet is currently enabled, last run 43 seconds ago with 0 failures [01:18:02] RECOVERY - puppet last run on labcontrol1001 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [01:19:53] RECOVERY - puppet last run on labservices1001 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [01:20:12] RECOVERY - puppet last run on labcontrol1004 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [01:21:29] !log mw1276 - run "scap pull" to get in sync after hardware issue, then pooled again (T181397) [01:21:33] (03PS10) 10Smalyshev: Enable configuration for aliasing namespaces [puppet] - 10https://gerrit.wikimedia.org/r/392554 (https://phabricator.wikimedia.org/T181016) [01:21:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:21:37] T181397: Lost network connectivity on mw1276 - https://phabricator.wikimedia.org/T181397 [01:22:03] RECOVERY - puppet last run on labservices1002 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [01:22:48] !log analytics1003 - closed idle screen session [01:22:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:25:38] !log snapshot1001 - closed idle screen session [01:25:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:32:46] (03PS1) 10Dzahn: analytics/hadoop: remove ganglia [puppet] - 10https://gerrit.wikimedia.org/r/393976 (https://phabricator.wikimedia.org/T177225) [01:33:12] (03CR) 10Dzahn: [C: 032] "replacement dashboard is at https://grafana.wikimedia.org/dashboard/db/analytics-hadoop?orgId=1" [puppet] - 10https://gerrit.wikimedia.org/r/393976 (https://phabricator.wikimedia.org/T177225) (owner: 10Dzahn) [01:33:20] (03PS2) 10Dzahn: analytics/hadoop: remove ganglia [puppet] - 10https://gerrit.wikimedia.org/r/393976 (https://phabricator.wikimedia.org/T177225) [01:33:27] (03CR) 10Dzahn: [V: 032 C: 032] analytics/hadoop: remove ganglia [puppet] - 10https://gerrit.wikimedia.org/r/393976 (https://phabricator.wikimedia.org/T177225) (owner: 10Dzahn) [01:45:04] (03PS3) 10Jforrester: Enable TimedMediaHandler's new video player Beta Feature [mediawiki-config] - 10https://gerrit.wikimedia.org/r/354390 (https://phabricator.wikimedia.org/T148103) [01:45:19] (03CR) 10Jforrester: [C: 04-2] "Not yet." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/354390 (https://phabricator.wikimedia.org/T148103) (owner: 10Jforrester) [01:49:32] RECOVERY - mediawiki-installation DSH group on mw1276 is OK: OK [02:24:23] PROBLEM - Nginx local proxy to apache on mw2108 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:24:44] !log l10nupdate@tin scap sync-l10n completed (1.31.0-wmf.8) (duration: 06m 08s) [02:24:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:25:13] RECOVERY - Nginx local proxy to apache on mw2108 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 617 bytes in 0.201 second response time [02:26:56] (03PS1) 10Dzahn: analytics misc,piwik,dataset,servermon: remove ganglia [puppet] - 10https://gerrit.wikimedia.org/r/393979 (https://phabricator.wikimedia.org/T177225) [02:28:26] (03PS2) 10Dzahn: analytics misc,piwik,dataset,servermon: remove ganglia [puppet] - 10https://gerrit.wikimedia.org/r/393979 (https://phabricator.wikimedia.org/T177225) [02:30:09] (03CR) 10Dzahn: [C: 032] analytics misc,piwik,dataset,servermon: remove ganglia [puppet] - 10https://gerrit.wikimedia.org/r/393979 (https://phabricator.wikimedia.org/T177225) (owner: 10Dzahn) [02:31:11] (03PS1) 10Dzahn: gerrit: restore system::role for motd [puppet] - 10https://gerrit.wikimedia.org/r/393980 [02:32:01] (03PS2) 10Dzahn: gerrit: restore system::role for motd [puppet] - 10https://gerrit.wikimedia.org/r/393980 [02:32:48] (03CR) 10Dzahn: [C: 032] gerrit: restore system::role for motd [puppet] - 10https://gerrit.wikimedia.org/r/393980 (owner: 10Dzahn) [02:39:38] (03PS1) 10Dzahn: elasticsearch,elastic-relforge: remove ganglia [puppet] - 10https://gerrit.wikimedia.org/r/393981 (https://phabricator.wikimedia.org/T177225) [02:40:14] (03PS2) 10Dzahn: elasticsearch,elastic-relforge: remove ganglia [puppet] - 10https://gerrit.wikimedia.org/r/393981 (https://phabricator.wikimedia.org/T177225) [02:40:19] (03CR) 10Dzahn: [C: 032] elasticsearch,elastic-relforge: remove ganglia [puppet] - 10https://gerrit.wikimedia.org/r/393981 (https://phabricator.wikimedia.org/T177225) (owner: 10Dzahn) [02:40:41] (03CR) 10Dzahn: [C: 032] "already confirmed in the past that Elasticsearch has all the things they need in grafana" [puppet] - 10https://gerrit.wikimedia.org/r/393981 (https://phabricator.wikimedia.org/T177225) (owner: 10Dzahn) [02:57:11] (03PS1) 10Dzahn: planet: add feeds of some Outreachy 17 participants [puppet] - 10https://gerrit.wikimedia.org/r/393985 (https://phabricator.wikimedia.org/T181587) [02:58:18] (03PS2) 10Dzahn: planet: add feeds of some Outreachy round 15 participants [puppet] - 10https://gerrit.wikimedia.org/r/393985 (https://phabricator.wikimedia.org/T181587) [02:58:27] (03PS3) 10Dzahn: planet: add feeds of some Outreachy round 15 participants [puppet] - 10https://gerrit.wikimedia.org/r/393985 (https://phabricator.wikimedia.org/T181587) [02:59:31] (03CR) 10Dzahn: [C: 032] planet: add feeds of some Outreachy round 15 participants [puppet] - 10https://gerrit.wikimedia.org/r/393985 (https://phabricator.wikimedia.org/T181587) (owner: 10Dzahn) [03:22:22] PROBLEM - cxserver endpoints health on scb1001 is CRITICAL: /v2/translate/{from}/{to}{/provider} (Machine translate an HTML fragment using Apertium, adapt the links to target language wiki.) timed out before a response was received: /v1/page/{language}/{title}{/revision} (Fetch enwiki Oxygen page) timed out before a response was received: /v1/dictionary/{word}/{from}/{to}{/provider} (Fetch dictionay meaning with a given provide [03:22:23] a response was received: /v1/dictionary/{word}/{from}/{to}{/provider} (Fetch dictionay meaning without specifying a provider) timed out before a response was received: / (root with no query params) timed out before a response was received: / (spec from root) timed out before a response was received: / (root with wrong query param) timed out before a response was received: /v1/list/{tool}{/from}{/to} (Get the MT tool between tw [03:22:23] imed out before a response was received: /_info/home (redirect to the home page) timed out before a response was received: /_info (retrieve service info) timed out before a response w [03:23:22] RECOVERY - cxserver endpoints health on scb1001 is OK: All endpoints are healthy [03:23:43] PROBLEM - graphoid endpoints health on scb1001 is CRITICAL: /{domain}/v1/{format}/{title}/{revid}/{id} (retrieve PNG from mediawiki.org) timed out before a response was received [03:24:23] PROBLEM - MariaDB Slave Lag: s1 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 675.43 seconds [03:24:42] RECOVERY - graphoid endpoints health on scb1001 is OK: All endpoints are healthy [03:30:43] PROBLEM - Long running screen/tmux on labstore1006 is CRITICAL: CRIT: Long running SCREEN process. (PID: 29022, 1754065s 1728000s). [03:44:02] RECOVERY - Long running screen/tmux on analytics1003 is OK: OK: No SCREEN or tmux processes detected. [03:44:12] RECOVERY - Long running screen/tmux on snapshot1001 is OK: OK: No SCREEN or tmux processes detected. [03:50:55] 10Operations, 10Ops-Access-Requests, 10Discovery, 10Wikidata, and 2 others: Enable wdqs-admin's to control nginx - https://phabricator.wikimedia.org/T181540#3795395 (10Dzahn) a:03Dzahn [03:50:58] 10Operations, 10Ops-Access-Requests: Requesting access to terbium/wasat for Trey Jones - https://phabricator.wikimedia.org/T181479#3795396 (10Dzahn) a:03Dzahn [03:59:23] PROBLEM - Nginx local proxy to apache on mw2132 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [04:00:13] RECOVERY - Nginx local proxy to apache on mw2132 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 617 bytes in 0.197 second response time [04:00:25] 10Operations, 10Ops-Access-Requests: Requesting access to terbium/wasat for Trey Jones - https://phabricator.wikimedia.org/T181479#3795402 (10Dzahn) Hi @TJones I will handle this access request. We might have to create a new admin group for this type of access, i will look at that. Meanwhile you could start b... [04:02:02] (03CR) 10Dzahn: "I think we should encourage using systemctl nowadays instead of the service command. As in the lines for wdqs-updater and:" [puppet] - 10https://gerrit.wikimedia.org/r/393814 (https://phabricator.wikimedia.org/T181540) (owner: 10Smalyshev) [04:08:52] RECOVERY - MariaDB Slave Lag: s1 on dbstore1002 is OK: OK slave_sql_lag Replication lag: 231.27 seconds [04:12:30] 10Operations, 10Ops-Access-Requests: Requesting access to terbium/wasat for Trey Jones - https://phabricator.wikimedia.org/T181479#3795404 (10Dzahn) We can probably use the existing group called "restricted". ``` restricted: gid: 706 description: access to terbium, mwlog hosts (private data) and ba... [04:14:41] 10Operations, 10Ops-Access-Requests: Requesting access to terbium/wasat for Trey Jones - https://phabricator.wikimedia.org/T181479#3795406 (10Dzahn) @Tjones please ignore my comments, i just realized you already have previous shell accesss (mw-log-readers, analytics-privatedata-uers). So all i said about SSH k... [04:18:35] (03PS1) 10Dzahn: admins: add tjones to group 'restricted' [puppet] - 10https://gerrit.wikimedia.org/r/393988 (https://phabricator.wikimedia.org/T181479) [04:53:32] Checking in. MW is not hammering ORES for "test_stats" [04:54:08] We did not deploy new code. It all just seems to be "fine" now :| [05:34:08] (03PS1) 10KartikMistry: apertium-crh-tur: New upstream release [debs/contenttranslation/apertium-crh-tur] - 10https://gerrit.wikimedia.org/r/393993 (https://phabricator.wikimedia.org/T181465) [05:34:19] (03CR) 10jerkins-bot: [V: 04-1] apertium-crh-tur: New upstream release [debs/contenttranslation/apertium-crh-tur] - 10https://gerrit.wikimedia.org/r/393993 (https://phabricator.wikimedia.org/T181465) (owner: 10KartikMistry) [05:37:11] (03PS1) 10Dzahn: mwlog/xenon: access should be based on role, not host names [puppet] - 10https://gerrit.wikimedia.org/r/393994 [06:12:33] !log ebernhardson@tin Started deploy [search/mjolnir/deploy@7aa39b7]: (no justification provided) [06:12:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:17:00] !log ebernhardson@tin Finished deploy [search/mjolnir/deploy@7aa39b7]: (no justification provided) (duration: 04m 27s) [06:17:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:20:09] (03PS1) 10Marostegui: s1.hosts: Add db1099:3311 [software] - 10https://gerrit.wikimedia.org/r/393998 (https://phabricator.wikimedia.org/T178359) [06:21:08] 10Operations, 10ops-codfw, 10DBA: Degraded RAID on db2055 - https://phabricator.wikimedia.org/T181266#3795470 (10Marostegui) 05Open>03Resolved Thanks @Papaul! ``` root@db2055:~# hpssacli controller all show config Smart Array P420i in Slot 0 (Embedded) (sn: 0014380337C9270) Port Name: 1I Por... [06:22:42] (03CR) 10Marostegui: [C: 032] s1.hosts: Add db1099:3311 [software] - 10https://gerrit.wikimedia.org/r/393998 (https://phabricator.wikimedia.org/T178359) (owner: 10Marostegui) [06:23:24] (03Merged) 10jenkins-bot: s1.hosts: Add db1099:3311 [software] - 10https://gerrit.wikimedia.org/r/393998 (https://phabricator.wikimedia.org/T178359) (owner: 10Marostegui) [06:25:06] (03PS1) 10Marostegui: db-eqiad.php: Repool db1055 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/393999 (https://phabricator.wikimedia.org/T178359) [06:27:02] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Repool db1055 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/393999 (https://phabricator.wikimedia.org/T178359) (owner: 10Marostegui) [06:28:02] PROBLEM - puppet last run on db2090 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [06:28:18] (03Merged) 10jenkins-bot: db-eqiad.php: Repool db1055 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/393999 (https://phabricator.wikimedia.org/T178359) (owner: 10Marostegui) [06:28:28] (03CR) 10jenkins-bot: db-eqiad.php: Repool db1055 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/393999 (https://phabricator.wikimedia.org/T178359) (owner: 10Marostegui) [06:29:30] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Repool db1055 with lower weight - T178359 (duration: 00m 50s) [06:29:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:29:37] T178359: Support multi-instance on core hosts - https://phabricator.wikimedia.org/T178359 [06:32:13] (03PS1) 10Marostegui: db-eqiad,db-codfw.php: Add db1099:3311 and 3318 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/394000 (https://phabricator.wikimedia.org/T178359) [06:35:51] (03PS1) 10EBernhardson: Revert "Revert "Deploy MjoLniR with new deploy repository"" [puppet] - 10https://gerrit.wikimedia.org/r/394002 [06:36:16] (03CR) 10jerkins-bot: [V: 04-1] Revert "Revert "Deploy MjoLniR with new deploy repository"" [puppet] - 10https://gerrit.wikimedia.org/r/394002 (owner: 10EBernhardson) [06:39:28] (03CR) 10Marostegui: [C: 032] db-eqiad,db-codfw.php: Add db1099:3311 and 3318 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/394000 (https://phabricator.wikimedia.org/T178359) (owner: 10Marostegui) [06:40:45] (03Merged) 10jenkins-bot: db-eqiad,db-codfw.php: Add db1099:3311 and 3318 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/394000 (https://phabricator.wikimedia.org/T178359) (owner: 10Marostegui) [06:40:59] (03CR) 10jenkins-bot: db-eqiad,db-codfw.php: Add db1099:3311 and 3318 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/394000 (https://phabricator.wikimedia.org/T178359) (owner: 10Marostegui) [06:42:01] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Add db1099:3311 and db1099:3318 to the config (depooled) T178359 (duration: 00m 49s) [06:42:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:42:08] T178359: Support multi-instance on core hosts - https://phabricator.wikimedia.org/T178359 [06:42:32] (03CR) 10EBernhardson: Revert "Revert "Deploy MjoLniR with new deploy repository"" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/394002 (owner: 10EBernhardson) [06:42:56] !log marostegui@tin Synchronized wmf-config/db-codfw.php: Add db1099:3311 and db1099:3318 to the config (depooled) T178359 (duration: 00m 48s) [06:43:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:43:32] (03PS1) 10Marostegui: db-eqiad.php: Increase traffic for db1055 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/394004 [06:47:31] (03PS1) 10Marostegui: mariadb: Nnotifications for db1099,reimage db1096 [puppet] - 10https://gerrit.wikimedia.org/r/394007 (https://phabricator.wikimedia.org/T178359) [06:48:15] (03PS2) 10Marostegui: mariadb: Nnotifications for db1099,reimage db1096 [puppet] - 10https://gerrit.wikimedia.org/r/394007 (https://phabricator.wikimedia.org/T178359) [06:48:36] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Increase traffic for db1055 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/394004 (owner: 10Marostegui) [06:48:41] (03PS3) 10Marostegui: mariadb: Notifications for db1099,reimage db1096 [puppet] - 10https://gerrit.wikimedia.org/r/394007 (https://phabricator.wikimedia.org/T178359) [06:50:02] (03Merged) 10jenkins-bot: db-eqiad.php: Increase traffic for db1055 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/394004 (owner: 10Marostegui) [06:50:16] (03CR) 10jenkins-bot: db-eqiad.php: Increase traffic for db1055 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/394004 (owner: 10Marostegui) [06:51:14] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Increase traffic for db1055 - T178359 (duration: 00m 48s) [06:51:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:51:21] T178359: Support multi-instance on core hosts - https://phabricator.wikimedia.org/T178359 [06:55:23] (03PS1) 10Marostegui: db-eqiad.php: Pool db1099 with low traffic [mediawiki-config] - 10https://gerrit.wikimedia.org/r/394008 (https://phabricator.wikimedia.org/T178359) [06:57:15] (03CR) 10Marostegui: [C: 032] "https://puppet-compiler.wmflabs.org/compiler02/9032/" [puppet] - 10https://gerrit.wikimedia.org/r/394007 (https://phabricator.wikimedia.org/T178359) (owner: 10Marostegui) [06:58:02] RECOVERY - puppet last run on db2090 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [07:14:22] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Pool db1099 with low traffic [mediawiki-config] - 10https://gerrit.wikimedia.org/r/394008 (https://phabricator.wikimedia.org/T178359) (owner: 10Marostegui) [07:15:52] (03Merged) 10jenkins-bot: db-eqiad.php: Pool db1099 with low traffic [mediawiki-config] - 10https://gerrit.wikimedia.org/r/394008 (https://phabricator.wikimedia.org/T178359) (owner: 10Marostegui) [07:17:07] (03CR) 10jenkins-bot: db-eqiad.php: Pool db1099 with low traffic [mediawiki-config] - 10https://gerrit.wikimedia.org/r/394008 (https://phabricator.wikimedia.org/T178359) (owner: 10Marostegui) [07:17:41] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Pool db1099:3318 with low weight - T178359 (duration: 00m 45s) [07:17:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:17:49] T178359: Support multi-instance on core hosts - https://phabricator.wikimedia.org/T178359 [07:26:25] (03PS1) 10Marostegui: db-eqiad.php: Increase traffic for db1099:3318 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/394010 [07:31:06] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Increase traffic for db1099:3318 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/394010 (owner: 10Marostegui) [07:32:25] (03Merged) 10jenkins-bot: db-eqiad.php: Increase traffic for db1099:3318 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/394010 (owner: 10Marostegui) [07:32:38] (03CR) 10jenkins-bot: db-eqiad.php: Increase traffic for db1099:3318 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/394010 (owner: 10Marostegui) [07:33:43] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Increase db1099:3318 traffic - T178359 (duration: 00m 45s) [07:33:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:33:51] T178359: Support multi-instance on core hosts - https://phabricator.wikimedia.org/T178359 [07:36:49] 10Operations, 10puppet-compiler: puppet compiler fail compilation on manifests using puppetdb - https://phabricator.wikimedia.org/T180671#3795595 (10Joe) 05Open>03Resolved a:03Joe [07:36:59] (03Abandoned) 10Giuseppe Lavagetto: Revert "jobrunner: drop number of "basic" jobs in favour of html ones" [puppet] - 10https://gerrit.wikimedia.org/r/393779 (owner: 10Giuseppe Lavagetto) [07:39:27] (03PS1) 10Marostegui: db-eqiad.php: Fully pool db1099:3318 and db1055 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/394012 [07:51:47] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Fully pool db1099:3318 and db1055 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/394012 (owner: 10Marostegui) [07:53:12] (03Merged) 10jenkins-bot: db-eqiad.php: Fully pool db1099:3318 and db1055 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/394012 (owner: 10Marostegui) [07:53:26] (03CR) 10jenkins-bot: db-eqiad.php: Fully pool db1099:3318 and db1055 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/394012 (owner: 10Marostegui) [07:54:24] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Fully pool db1099:3318 and db1055 - T178359 (duration: 00m 48s) [07:54:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:54:31] T178359: Support multi-instance on core hosts - https://phabricator.wikimedia.org/T178359 [08:02:42] 10Operations, 10Scoring-platform-team, 10Patch-For-Review, 10Wikimedia-Incident: ORES overload incident, 2017-11-28 - https://phabricator.wikimedia.org/T181538#3795673 (10akosiaris) [08:02:45] 10Operations, 10Scoring-platform-team (Current), 10Wikimedia-Incident: Investigate "Asynchronous AOF fsync is taking too long" on oresrdb200* - https://phabricator.wikimedia.org/T181563#3795671 (10akosiaris) 05Resolved>03Open Re-opening per the following: After a brief discussion in #wikimedia-ai at ~01... [08:08:12] (03PS1) 10Marostegui: db-eqiad.php: Depool db1096 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/394014 (https://phabricator.wikimedia.org/T178359) [08:15:31] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Depool db1096 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/394014 (https://phabricator.wikimedia.org/T178359) (owner: 10Marostegui) [08:16:49] (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1096 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/394014 (https://phabricator.wikimedia.org/T178359) (owner: 10Marostegui) [08:17:01] (03CR) 10jenkins-bot: db-eqiad.php: Depool db1096 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/394014 (https://phabricator.wikimedia.org/T178359) (owner: 10Marostegui) [08:17:52] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Depool db1096 - T178359 (duration: 00m 48s) [08:17:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:17:59] T178359: Support multi-instance on core hosts - https://phabricator.wikimedia.org/T178359 [08:19:15] (03CR) 10Muehlenhoff: [V: 032 C: 032] Create prometheus user and switch systemd unit to it [debs/prometheus-openldap-exporter] - 10https://gerrit.wikimedia.org/r/393811 (owner: 10Muehlenhoff) [08:21:06] !log Stop MySQL on db1096 to transfer its content to dbstore1001 to reimage it later - T178359 [08:21:12] (03PS1) 10Muehlenhoff: Use dh-systemd [debs/prometheus-openldap-exporter] - 10https://gerrit.wikimedia.org/r/394015 [08:26:43] (03CR) 10Muehlenhoff: [V: 032 C: 032] Use dh-systemd [debs/prometheus-openldap-exporter] - 10https://gerrit.wikimedia.org/r/394015 (owner: 10Muehlenhoff) [08:28:42] (03PS1) 10Marostegui: mariadb: Convert db1096 to multi-instance: s5,s6 [puppet] - 10https://gerrit.wikimedia.org/r/394016 (https://phabricator.wikimedia.org/T178359) [08:30:15] (03PS1) 10Muehlenhoff: Add a .gitreview file [debs/prometheus-openldap-exporter] - 10https://gerrit.wikimedia.org/r/394018 [08:30:41] (03CR) 10Muehlenhoff: [V: 032 C: 032] Add a .gitreview file [debs/prometheus-openldap-exporter] - 10https://gerrit.wikimedia.org/r/394018 (owner: 10Muehlenhoff) [08:35:29] (03PS1) 10Muehlenhoff: Fix distribution [debs/prometheus-openldap-exporter] - 10https://gerrit.wikimedia.org/r/394019 [08:36:20] (03CR) 10Muehlenhoff: [V: 032 C: 032] Fix distribution [debs/prometheus-openldap-exporter] - 10https://gerrit.wikimedia.org/r/394019 (owner: 10Muehlenhoff) [08:38:30] (03PS6) 10Elukey: Rename ::profile::hadoop::client to commons and move some features out [puppet] - 10https://gerrit.wikimedia.org/r/393738 (https://phabricator.wikimedia.org/T167790) [08:41:25] !log uploaded prometheus-openldap-exporter 0+git20171128-1 for jessie-wikimedia (T181511) [08:41:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:41:33] T181511: Package openldap collector for Prometheus and adapt metrics - https://phabricator.wikimedia.org/T181511 [08:43:46] (03PS11) 10Gehel: Enable configuration for aliasing namespaces [puppet] - 10https://gerrit.wikimedia.org/r/392554 (https://phabricator.wikimedia.org/T181016) (owner: 10Smalyshev) [08:44:25] (03CR) 10Gehel: [C: 032] Enable configuration for aliasing namespaces [puppet] - 10https://gerrit.wikimedia.org/r/392554 (https://phabricator.wikimedia.org/T181016) (owner: 10Smalyshev) [08:45:38] (03PS2) 10Jcrespo: Revert "mariadb: Depool db1110 for maintenance" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/393795 [08:49:26] (03CR) 10Elukey: [C: 032] "No op from https://puppet-compiler.wmflabs.org/compiler02/9033/" [puppet] - 10https://gerrit.wikimedia.org/r/393738 (https://phabricator.wikimedia.org/T167790) (owner: 10Elukey) [08:49:33] (03PS7) 10Elukey: Rename ::profile::hadoop::client to commons and move some features out [puppet] - 10https://gerrit.wikimedia.org/r/393738 (https://phabricator.wikimedia.org/T167790) [08:50:51] PROBLEM - nova-compute process on labvirt1010 is CRITICAL: PROCS CRITICAL: 2 processes with regex args ^/usr/bin/pytho[n] /usr/bin/nova-compute [08:51:34] (03CR) 10Gehel: [C: 04-1] Create script for automatic reload of categories (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/392736 (https://phabricator.wikimedia.org/T173772) (owner: 10Smalyshev) [08:51:44] RECOVERY - nova-compute process on labvirt1010 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n] /usr/bin/nova-compute [08:53:38] (03PS4) 10Smalyshev: Create script for automatic reload of categories [puppet] - 10https://gerrit.wikimedia.org/r/392736 (https://phabricator.wikimedia.org/T173772) [08:55:03] (03CR) 10Jdrewniak: [C: 031] Remove www.*.org symlinks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/391355 (owner: 10Chad) [08:57:45] (03PS2) 10Filippo Giunchedi: Revert "hieradata: disable restbase1007-c cassandra instance" [puppet] - 10https://gerrit.wikimedia.org/r/393810 (https://phabricator.wikimedia.org/T179422) [08:59:33] (03CR) 10Filippo Giunchedi: [C: 032] Revert "hieradata: disable restbase1007-c cassandra instance" [puppet] - 10https://gerrit.wikimedia.org/r/393810 (https://phabricator.wikimedia.org/T179422) (owner: 10Filippo Giunchedi) [09:00:31] (03CR) 10Marostegui: [C: 031] Revert "mariadb: Depool db1110 for maintenance" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/393795 (owner: 10Jcrespo) [09:03:53] (03CR) 10Jcrespo: [C: 032] Revert "mariadb: Depool db1110 for maintenance" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/393795 (owner: 10Jcrespo) [09:04:39] (03PS2) 10Marostegui: mariadb: Convert db1096 to multi-instance: s5,s6 [puppet] - 10https://gerrit.wikimedia.org/r/394016 (https://phabricator.wikimedia.org/T178359) [09:04:43] !log bootstrap restbase1007-c - T179422 [09:04:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:04:51] T179422: Reshape RESTBase Cassandra clusters - https://phabricator.wikimedia.org/T179422 [09:05:20] (03Merged) 10jenkins-bot: Revert "mariadb: Depool db1110 for maintenance" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/393795 (owner: 10Jcrespo) [09:06:58] (03CR) 10jenkins-bot: Revert "mariadb: Depool db1110 for maintenance" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/393795 (owner: 10Jcrespo) [09:07:54] PROBLEM - Esams HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3fullscreenorgId=1var-site=esamsvar-cache_type=Allvar-status_type=5 [09:09:14] PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3fullscreenorgId=1var-site=Allvar-cache_type=textvar-status_type=5 [09:09:37] mhh looks like already recovered? according to logstash [09:09:51] mostly cp3033 ints afaics [09:10:02] but it seems indeed one single spike [09:10:33] (03PS4) 10Jcrespo: mariadb: Setup s8 empty on eqiad [mediawiki-config] - 10https://gerrit.wikimedia.org/r/393808 (https://phabricator.wikimedia.org/T177208) [09:11:27] (03PS1) 10Gehel: wdqs: schedule cronjob to reload categories [puppet] - 10https://gerrit.wikimedia.org/r/394021 (https://phabricator.wikimedia.org/T173772) [09:11:46] (03CR) 10jerkins-bot: [V: 04-1] wdqs: schedule cronjob to reload categories [puppet] - 10https://gerrit.wikimedia.org/r/394021 (https://phabricator.wikimedia.org/T173772) (owner: 10Gehel) [09:11:46] I will wait to resover for deploy [09:12:01] https://grafana.wikimedia.org/dashboard/db/varnish-failed-fetches?orgId=1&from=now-3h&to=now [09:12:13] hi, Who would know about indexation of Wikimedia user pages, and user talk pages? [09:12:43] Someone has asked me about this, and I don't know what to say [09:13:10] i.e. https://fr.wikisource.org/w/index.php?title=Discussion_utilisateur:Nomen_ad_hoc&diff=7030530&oldid=6132072 [09:13:57] it seems the __NOINDEX__ tag has no effect [09:14:38] where should I ask? [09:14:45] or redirect this user to? [09:15:35] (03CR) 10Filippo Giunchedi: bast4002: switch over prometheus (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/393943 (owner: 10BBlack) [09:15:58] yannf: for questions there is #wikimedia-tech , I think [09:16:20] ye [09:16:32] (03PS7) 10Elukey: profile::hadoop::common,profile::hive::client: move hiera config in one place [puppet] - 10https://gerrit.wikimedia.org/r/393741 (https://phabricator.wikimedia.org/T167790) [09:16:49] I would try there first, and if they tell you there is a problem, a ticker can be filed on phabricator ( https://phabricator.wikimedia.org ) [09:16:57] *ticket [09:17:06] (03PS1) 10Alexandros Kosiaris: Disable ORES redis persistence for queue [puppet] - 10https://gerrit.wikimedia.org/r/394022 (https://phabricator.wikimedia.org/T181563) [09:17:36] PROBLEM - Upload HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3fullscreenorgId=1var-site=Allvar-cache_type=uploadvar-status_type=5 [09:18:25] yannf: are you sure you want __NOINDEX__ and not __NOTOC__ ? [09:19:32] now upload ? [09:19:59] there is another spike [09:20:06] yeah, looking [09:20:45] https://logstash.wikimedia.org/goto/c24e7c257d641e73034cfe60eb534ca0 [09:23:19] do you see any pattern? [09:23:31] cp3033 seems the most affected one afaics [09:23:38] y [09:24:30] but it might be due to a specific link that gets hashed to it? [09:24:30] cp3033 is a text host btw [09:24:43] !log mobrovac@tin Started deploy [electron-render/deploy@94d27d7]: Update to electron v1.7.9 and start using the Charter font - T181200 [09:24:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:24:50] T181200: Use "Charter" as preferred typeface on Electron - https://phabricator.wikimedia.org/T181200 [09:25:12] huh, chrome's login form doesn't show the text anymore? [09:27:16] RECOVERY - Text HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3fullscreenorgId=1var-site=Allvar-cache_type=textvar-status_type=5 [09:28:22] jynus, yes, __NOINDEX__, this user doesn't want the talk page to be indexed by Google [09:28:45] RECOVERY - Upload HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3fullscreenorgId=1var-site=Allvar-cache_type=uploadvar-status_type=5 [09:28:46] !log mobrovac@tin Finished deploy [electron-render/deploy@94d27d7]: Update to electron v1.7.9 and start using the Charter font - T181200 (duration: 04m 03s) [09:28:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:30:06] RECOVERY - Esams HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3fullscreenorgId=1var-site=esamsvar-cache_type=Allvar-status_type=5 [09:30:24] yannf: that page has "" in the HTML source, so the __NOINDEX__ is working. If it's still in Google, that's a problem on their end [09:30:44] legoktm, ok, thanks [09:36:46] (03CR) 10Elukey: [C: 032] "No op from https://puppet-compiler.wmflabs.org/compiler03/9036/" [puppet] - 10https://gerrit.wikimedia.org/r/393741 (https://phabricator.wikimedia.org/T167790) (owner: 10Elukey) [09:39:40] !log upload cassandra-tools-wmf 1.0.2-1 - T181438 [09:39:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:39:47] T181438: Upload new cassandra-tools-wmf package to Debian repository - https://phabricator.wikimedia.org/T181438 [09:39:52] 10Operations, 10Cassandra, 10User-Eevans: Upload new cassandra-tools-wmf package to Debian repository - https://phabricator.wikimedia.org/T181438#3795842 (10fgiunchedi) 05Open>03Resolved a:03fgiunchedi [09:40:31] (03CR) 10Marostegui: [C: 032] mariadb: Convert db1096 to multi-instance: s5,s6 [puppet] - 10https://gerrit.wikimedia.org/r/394016 (https://phabricator.wikimedia.org/T178359) (owner: 10Marostegui) [09:40:37] (03PS3) 10Marostegui: mariadb: Convert db1096 to multi-instance: s5,s6 [puppet] - 10https://gerrit.wikimedia.org/r/394016 (https://phabricator.wikimedia.org/T178359) [09:41:34] (03CR) 10Filippo Giunchedi: "PCC-happy https://puppet-compiler.wmflabs.org/compiler03/9035/" [puppet] - 10https://gerrit.wikimedia.org/r/393794 (https://phabricator.wikimedia.org/T148637) (owner: 10Filippo Giunchedi) [09:46:30] !log jynus@tin Synchronized wmf-config/db-eqiad.php: Repool db1110 (duration: 00m 49s) [09:46:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:48:04] we have a large number of connections failing on s5 [09:48:37] it is db1110 [09:48:39] I have been monitoring logtash and I don't see a huge increase of errors [09:48:42] Ah [09:48:54] Failing with what? [09:49:09] (03PS1) 10Jcrespo: Revert "Revert "mariadb: Depool db1110 for maintenance"" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/394024 [09:49:17] (03CR) 10Jcrespo: [V: 032 C: 032] Revert "Revert "mariadb: Depool db1110 for maintenance"" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/394024 (owner: 10Jcrespo) [09:49:29] (03CR) 10jenkins-bot: Revert "Revert "mariadb: Depool db1110 for maintenance"" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/394024 (owner: 10Jcrespo) [09:49:58] (03PS1) 10Muehlenhoff: Add Prometheus exporter to openldap/labs [puppet] - 10https://gerrit.wikimedia.org/r/394025 [09:50:00] https://logstash.wikimedia.org/goto/e303956866cd2bce403481f04327b8bb [09:50:20] (03CR) 10jerkins-bot: [V: 04-1] Add Prometheus exporter to openldap/labs [puppet] - 10https://gerrit.wikimedia.org/r/394025 (owner: 10Muehlenhoff) [09:50:25] !log jynus@tin Synchronized wmf-config/db-eqiad.php: Depool db1110 (duration: 00m 48s) [09:50:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:50:40] Ah, I was using another channel that is why I didn't see it [09:51:13] Can't connect to MySQL server on '10.64.32.31' [09:51:58] Yeah, it took me AGES to connect to its IP [09:52:25] <_joe_> again? [09:52:27] there is still ip-level issues [09:52:30] <_joe_> uhm [09:52:33] this is horrible [09:52:36] <_joe_> jynus: how do you know? [09:52:59] _joe_: see my above link [09:53:06] unless we can find another explanation [09:53:23] mysql is up on that ip [09:53:33] yeah, it is [09:53:34] but the error is tcp or below [09:53:42] not app-level [09:53:42] but it took me ages from neodymium to connect to the IP [09:53:53] <_joe_> marostegui: define connect to the ip please [09:53:58] sorry [09:54:01] <_joe_> :) [09:54:03] mysql --skip-ssl -h10.64.32.31 [09:54:04] XD [09:54:32] <_joe_> the pings are ok [09:55:07] <_joe_> using telnet, from terbium, I can see the connection instantly [09:55:08] the dig reverse resolution from all eqiad is correct [09:55:14] <_joe_> telnet 10.64.32.31 3306 [09:55:25] <_joe_> so it's a mysql issue from all I can see [09:55:35] look at the error [09:55:48] <_joe_> actually now even mysql is fast [09:55:57] <_joe_> are errors still ongoing? [09:56:02] no, jaime depooled it [09:56:04] I have depooled it [09:56:27] <_joe_> anything suspicious in the server logs? [09:56:51] define "server logs" please [09:57:04] <_joe_> syslog dmesg etc [09:57:11] <_joe_> server as opposed to database [09:57:59] <_joe_> I'm trying to understand where the errors came from mostly, and investigate from there [09:58:28] I can connect fine from one of thehosts that failed, to db1110 [09:58:33] mw1198 [09:58:46] <_joe_> because errors seem to come from a small subset of servers [09:58:53] <_joe_> appservers I mean [09:58:56] nothing [10:00:00] could there be some app-level cache for dns or ip? [10:00:18] but marostegui saw slowness from neodymium, too [10:00:57] <_joe_> ok most of the servers reporting errors were in rack c6 [10:01:23] I have checked and app load wasn't enough to justify slowness [10:01:29] app here == mysql [10:01:47] <_joe_> can someone look at librenms for anything suspicious? I'm looking at something else [10:01:51] (03PS2) 10Muehlenhoff: Add Prometheus exporter to openldap/labs [puppet] - 10https://gerrit.wikimedia.org/r/394025 [10:01:54] top mw* hosts with errors seems to be in the same rack to me, but I didn't check it thourogly yet [10:02:07] <_joe_> volans: c6, just said it [10:02:21] <_joe_> the db is on c3 [10:02:23] <_joe_> so same row [10:02:33] <_joe_> and those are basically all appservers in row C [10:02:38] in fact, almost no connection went through [10:02:57] <_joe_> so interestingly, c6 is the rack where mw1326 is located [10:03:05] <_joe_> did someone turn it on with the wrong IP? [10:03:32] it was reimaged yesterday, and bblack checked it had the right ip [10:03:37] I don't know after that [10:03:42] mw1326? [10:03:49] uptime reports 14 days eh [10:04:05] <_joe_> wasn't it the server with the wrong duplicate IP? [10:04:12] <_joe_> or do I remember the number wrong? [10:04:16] mw1329 [10:04:17] https://gerrit.wikimedia.org/r/#/c/393787/1/templates/10.in-addr.arpa [10:04:17] mw1329 [10:04:28] yep [10:04:39] arp cache on mw1200 has a different mac [10:04:39] and it seems down afaics, I tried to ssh to it [10:05:05] <_joe_> volans: [10:05:13] I just checked its ILO [10:05:13] <_joe_> how did you pick mw1200 as well? [10:05:20] It is on the initramfs [10:05:23] so it is down yes [10:05:34] <_joe_> no it's not down [10:05:39] <_joe_> power off the damn thing [10:05:39] db1110 is 80:18:44:df:d2:00 [10:05:49] _joe_: but it is on the initramfs [10:06:11] <_joe_> ok so let me test one thing [10:06:14] talking about mw1329 [10:06:18] <_joe_> volans: don't touch mw1200 [10:06:20] <_joe_> please [10:06:27] ok [10:06:35] ok, one at a time [10:06:54] _joe_: you lead [10:07:14] tell us to do stuff if you need it [10:07:22] <_joe_> ok confirmed: mw1200 had still the wrong arp [10:07:29] _joe_: see above [10:07:33] <_joe_> but cleaning it by hand got the good one [10:07:47] <_joe_> by pinging the ip again [10:08:00] <_joe_> jynus: did you just repool db1110 this morning? [10:08:06] right now [10:08:12] when the errors started [10:08:15] <_joe_> ok [10:08:20] <_joe_> so the hosts in c6 [10:08:22] then depool when I saw the errors [10:08:27] <_joe_> are in the same rack as mw1329 [10:08:38] (03PS3) 10Muehlenhoff: Add Prometheus exporter to openldap/labs [puppet] - 10https://gerrit.wikimedia.org/r/394025 [10:08:41] <_joe_> so they had registered the arp for that ip yesterday [10:08:43] <_joe_> in their cache [10:08:47] yes [10:08:50] <_joe_> and we didn't flush those [10:08:52] do you think it is a host-created error, or a router [10:09:03] <_joe_> jynus: us-created :P [10:09:06] pure host cache? [10:09:12] <_joe_> jynus: as in we didn't properly cleanup yesterday [10:09:15] but correct everthere else? [10:09:16] <_joe_> lemme fix this [10:09:21] <_joe_> jynus: yes [10:09:26] ok, good [10:09:38] <_joe_> I do think we cleaned the switches cache, if they have one [10:09:52] <_joe_> you know for me network gear is a series of black monoliths :P [10:09:53] but I thought arp happens all the time? [10:10:05] but maybe I do not understand that well our network [10:10:06] <_joe_> jynus: only when a value gets invalidated [10:10:27] <_joe_> I don't know how long is the arp cache tbh [10:10:28] (03PS2) 10Alexandros Kosiaris: Disable ORES redis persistence for queue [puppet] - 10https://gerrit.wikimedia.org/r/394022 (https://phabricator.wikimedia.org/T181563) [10:10:42] and the host has been up all this time [10:10:47] just not pooled on mediawiki [10:11:05] <_joe_> so. I can check and clean the arp caches for this across the appservers pool [10:11:07] should I restart the db, just to be sure? [10:11:09] <_joe_> lemme do a check [10:11:12] <_joe_> jynus: nope [10:11:15] ok [10:11:15] _joe_: just ping [10:11:22] no need to flush [10:11:23] <_joe_> volans: it's not enough [10:11:33] <_joe_> at least wasn't on mw1200 [10:11:37] we can do a ping anyway afterwards [10:11:37] <_joe_> lemme try elsewhere [10:11:41] to check? [10:11:57] you a have a list of servers with connection problems on the link I sent of logs [10:12:15] <_joe_> jynus: it's simply all the hosts in c6 [10:12:24] <_joe_> so sharing the same rack as mw1329 [10:12:32] I see they are consecutive [10:12:35] so it makes sense [10:13:09] what a headacke traffic/network is [10:13:24] PROBLEM - Restbase root url on restbase1012 is CRITICAL: connect to address 10.64.32.79 and port 7231: Connection refused [10:14:24] (03CR) 10Alexandros Kosiaris: [C: 032] Disable ORES redis persistence for queue [puppet] - 10https://gerrit.wikimedia.org/r/394022 (https://phabricator.wikimedia.org/T181563) (owner: 10Alexandros Kosiaris) [10:15:07] _joe_: already fixed on all? I was testing something :( [10:15:22] !log disable puppet on oresrdb* for merging https://gerrit.wikimedia.org/r/#/c/394022/. T181563 [10:15:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:15:30] T181563: Investigate "Asynchronous AOF fsync is taking too long" on oresrdb200* - https://phabricator.wikimedia.org/T181563 [10:16:31] (03PS8) 10Elukey: profile::hadoop::common: import hiera config from cdh::hadoop [puppet] - 10https://gerrit.wikimedia.org/r/393756 (https://phabricator.wikimedia.org/T167790) [10:16:32] ok on a host where is still failing trying to ping I can see only ARP requests, no reply [10:17:10] <_joe_> volans: I fixed everything I think? [10:17:12] <_joe_> cumin 'P{R:class = role::mediawiki::webserver} and P{F:lldp_parent = asw-c-eqiad}' 'ping -c 10 10.64.32.31' [10:17:36] <_joe_> and then confirmed with cumin 'P{R:class = role::mediawiki::webserver} and P{F:lldp_parent = asw-c-eqiad}' 'arp -a | grep 10.64.32.31' [10:17:39] meh for not using aliases :D [10:17:51] (03PS5) 10Jcrespo: mariadb: Setup s8 empty on eqiad [mediawiki-config] - 10https://gerrit.wikimedia.org/r/393808 (https://phabricator.wikimedia.org/T177208) [10:18:35] <_joe_> volans: I have it in my history [10:18:57] <_joe_> volans: the correct thing would've been P{O:mediawiki::webserver} prolly [10:20:02] <_joe_> anyways, I think the problem is solved [10:20:09] not sure why we don't have an alias for all but A:all-mw-eqiad and A:all-mw-codfw should do [10:20:14] <_joe_> we have only one way to be sure though :) [10:20:52] <_joe_> volans: the fact that the aliases file is not readable by non-roots doesnt' exactly help using those [10:20:55] _joe_: so I've seen ARP requests going out and no reply coming, like if the switch didn't sent them to the broader network because it has cached that it's one of it's hosts [10:21:03] so maybe we might still need to clean that cache [10:21:12] <_joe_> volans: well, pinging worked [10:21:12] (03PS1) 10Jcrespo: mariadb: Repool db1110 after arp problems with minimal load [mediawiki-config] - 10https://gerrit.wikimedia.org/r/394028 (https://phabricator.wikimedia.org/T165519) [10:21:18] not at first try [10:21:21] <_joe_> all machines have the good arp [10:21:41] I have prepared a patch with a load of 1 [10:21:44] <_joe_> sure checking the C6 switch can't do harm [10:21:56] yep that's my point ;) [10:22:05] <_joe_> please do :P [10:22:08] of course, I wil lwait [10:22:10] ENOACCESS [10:22:34] <_joe_> me either, so, what do you want me to do? :) [10:22:57] ping ema :D [10:23:16] volans: hey [10:23:34] <_joe_> ok you two talk :) [10:23:35] ema: if you're not too busy, could you check arp cache in C6's switch for 10.64.32.31? [10:23:44] eqiad [10:24:10] if you give us the right answer we've a cookie for you :-P [10:24:38] <_joe_> if it's not the right answer, you'll get a flash supercookie instead [10:25:05] so that would be asw-c-eqiad [10:25:07] * _joe_ goes to hide in shame [10:26:24] ema: good question, given that the issue we've seen seems rack-specific, I'm guessing each physical one has it's own arp cache/ connected hosts [10:26:47] <_joe_> yeah it's the rack-level switch [10:26:50] `show arp` does not list 10.64.32.31 on asw-c-eqiad [10:27:03] <_joe_> other racks in the same row did have the correct arp [10:27:04] show ethernet-switching-table [10:27:09] switches DO not have arp :P [10:27:19] well they do, but for their management which is IP [10:27:24] switching is layer2 [10:27:53] ema@asw-c-eqiad> show ethernet-switching table |match 10.64.32.21 [10:27:55] no output ^ [10:28:04] of course no output... it's layer2 :P [10:28:08] you need a mac address [10:28:13] ah! [10:28:21] but what are you trying to figure out ? [10:28:33] great question! [10:28:36] so you remember yesterdays issue? [10:28:42] <_joe_> if the wrong mac address/ip association is cached there [10:28:43] the ores one ? very well [10:28:43] it reapeared today [10:28:44] :P [10:28:52] <_joe_> akosiaris: no the db1110 one [10:28:54] no, the one with mw [10:28:57] <_joe_> the double ip declaration [10:29:13] <_joe_> so turns out no one cleaned the arp caches of hosts in the same rack as mw1329 [10:29:20] <_joe_> so when they repooled db1110 [10:29:33] yeah they should take about 5 mins to clear on their own [10:29:34] <_joe_> those servers went straight to the wrong machine :P [10:29:42] akosiaris: well, it didn't [10:29:43] <_joe_> akosiaris: turns out that's not the case [10:29:53] <_joe_> akosiaris: ftr, I thought the same [10:29:55] the 20000 errors think differently :-) [10:30:19] I haven't followed the db1110 thing yesterday properly, but I can follow this one now [10:30:30] so, mind recaping what it is that you are seeing ? [10:30:43] actually, only like 7000 errors [10:30:52] so all hosts in C6 rack, the same rack of the host that stole db1110 IP yesterday [10:31:08] were having still today the wrong ARP cache for 10.64.32.31 [10:31:22] so were trying to connect to that wrong host instead of the right DB on another rack [10:31:33] this happened only for hosts in the same rack of the one that stole the IP [10:31:46] and their wrong arp cache entry was (until jow fixed them manually): [10:31:49] 10.64.32.31 ether 18:66:da:99:24:91 C eth0 [10:32:07] db1110 MAC is 80:18:44:df:d2:00 for reference [10:32:31] give me an example hostname [10:32:39] mw1200 [10:32:43] but now is fixed [10:32:51] er, one that is not [10:33:01] all are, joe fixed them all :D [10:33:20] I have an ARP tcpdump with only requests, no replies [10:33:46] if that can help :) [10:33:49] <_joe_> volans: which is expected given db1110 is not in the same rack? [10:34:07] no [10:34:08] <_joe_> no the row would be what matters in this case [10:34:30] <_joe_> same subnet, right [10:34:32] I got a reply once fixed [10:34:34] yep [10:35:14] PROBLEM - eventstreams on scb1001 is CRITICAL: connect to address 10.64.0.16 and port 8092: Connection refused [10:35:23] so akosiaris my wild guess is that the switch on C6, the physical one, had cached that 18:66:da:99:24:91 had 10.64.32.31 and that 10.64.32.31 is db1110.eqiad.wmnet [10:35:40] switches do not have ip => mac mappings [10:35:50] not the otherway around [10:35:55] s/not/nor/ [10:36:05] nothing layer3 on the switch... [10:36:31] so if the hosts had stale arp caches, something was populating them [10:36:44] PROBLEM - cxserver endpoints health on scb1001 is CRITICAL: /v2/translate/{from}/{to}{/provider} (Machine translate an HTML fragment using Apertium, adapt the links to target language wiki.) timed out before a response was received: /v1/page/{language}/{title}{/revision} (Fetch enwiki Oxygen page) timed out before a response was received: /v1/dictionary/{word}/{from}/{to}{/provider} (Fetch dictionay meaning with a given provide [10:36:45] a response was received: /v1/dictionary/{word}/{from}/{to}{/provider} (Fetch dictionay meaning without specifying a provider) timed out before a response was received: /_info (retrieve service info) timed out before a response was received: /v1/mt/{from}/{to}{/provider} (Machine translate an HTML fragment using Apertium.) timed out before a response was received [10:36:45] PROBLEM - mobileapps endpoints health on scb1001 is CRITICAL: /{domain}/v1/page/summary/{title}{/revision}{/tid} (Get summary for Barack Obama) timed out before a response was received: /{domain}/v1/page/media/{title} (retrieve images and videos of en.wp Cat page via media route) timed out before a response was received: /{domain}/v1/page/mobile-sections-lead/{title} (retrieve lead section of en.wp Altrincham page via mobile-se [10:36:45] out before a response was received: /{domain}/v1/feed/onthisday/{type}/{mm}/{dd} (retrieve all events on January 15) timed out before a response was received: /{domain}/v1/page/definition/{title} (retrieve en-wiktionary definitions for cat) timed out before a response was received [10:36:51] hmmm [10:36:58] ok, more important stuff [10:37:24] <_joe_> scb1001 has memory issues I guess, lemme see [10:37:34] PROBLEM - graphoid endpoints health on scb1001 is CRITICAL: /{domain}/v1/{format}/{title}/{revid}/{id} (retrieve PNG from mediawiki.org) timed out before a response was received [10:37:44] RECOVERY - cxserver endpoints health on scb1001 is OK: All endpoints are healthy [10:37:44] RECOVERY - mobileapps endpoints health on scb1001 is OK: All endpoints are healthy [10:37:50] this time it could be trending edits again [10:37:54] <_joe_> load average: 214.61, 236.93, 152.55 [10:37:58] ugh [10:38:06] yeah that's probably related to the ores change [10:38:14] RECOVERY - eventstreams on scb1001 is OK: HTTP OK: HTTP/1.1 200 OK - 929 bytes in 0.035 second response time [10:38:25] RECOVERY - graphoid endpoints health on scb1001 is OK: All endpoints are healthy [10:38:53] so OOM did not show up [10:39:02] well it did, but it only killed electron [10:39:09] and that's happening all the time [10:40:10] <_joe_> sigh [10:40:37] cpu usage increase a lot too on scb1001 and scb1002 [10:40:48] (03PS3) 10MarcoAurelio: [WIP] puppet: redirect several wikis per LangCom decission [puppet] - 10https://gerrit.wikimedia.org/r/393289 (https://phabricator.wikimedia.org/T169450) [10:41:27] I am guessing service-runner lost heartbeats once more ? [10:41:54] <_joe_> the main cpu hog is celery though [10:42:12] <_joe_> but yes, iotop showed node apps reading from disk a lot from time to time [10:42:22] <_joe_> which is consistent with workers respawning [10:43:34] I am guessing the celery workers can not reconnect to redis [10:43:47] so... dying and being respawned and having to be reinitialized and all ? [10:43:58] cause CPU spikes, starting service-runners [10:44:25] it's also interesting this is happening only on scb1001, scb1002 [10:44:44] and mostly 1001.. probably changeprop is active there and adds to the domino [10:45:18] but overall those 2 boxes seem like they can't handle all that load. I 'll try and lower celery concurrency for those [10:45:36] they are already receiving 33% less http reqs anyway [10:46:32] (03CR) 10Ema: [C: 031] puppet: point codfw misc and canary cp hosts at codfw puppet4 masters [puppet] - 10https://gerrit.wikimedia.org/r/392676 (https://phabricator.wikimedia.org/T177254) (owner: 10Herron) [10:51:58] (03PS1) 10Alexandros Kosiaris: ORES: lower celery concurrency for scb100{1,2} [puppet] - 10https://gerrit.wikimedia.org/r/394037 (https://phabricator.wikimedia.org/T181538) [10:52:13] this should alleviate some of the problems ^ [10:53:18] (03CR) 10Alexandros Kosiaris: [C: 032] ORES: lower celery concurrency for scb100{1,2} [puppet] - 10https://gerrit.wikimedia.org/r/394037 (https://phabricator.wikimedia.org/T181538) (owner: 10Alexandros Kosiaris) [11:00:34] (03PS1) 10Filippo Giunchedi: role: poll mtail metrics from syslog servers [puppet] - 10https://gerrit.wikimedia.org/r/394038 [11:03:05] (03CR) 10Filippo Giunchedi: Add Prometheus exporter to openldap/labs (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/394025 (owner: 10Muehlenhoff) [11:11:18] 10Operations, 10ops-eqiad: Please move db1110 and change its ip - https://phabricator.wikimedia.org/T181613#3796029 (10jcrespo) [11:13:32] (03CR) 10Filippo Giunchedi: [C: 032] role: poll mtail metrics from syslog servers [puppet] - 10https://gerrit.wikimedia.org/r/394038 (owner: 10Filippo Giunchedi) [11:14:11] 10Operations, 10ops-eqiad: Please move db1110 and change its ip - https://phabricator.wikimedia.org/T181613#3796056 (10jcrespo) [11:18:23] (03CR) 10Marostegui: "See the comments in line, it looks good only one server missing and converting db1096 to multi-instance" (034 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/393808 (https://phabricator.wikimedia.org/T177208) (owner: 10Jcrespo) [11:19:12] 10Operations, 10Scoring-platform-team, 10Patch-For-Review, 10Wikimedia-Incident: ORES overload incident, 2017-11-28 - https://phabricator.wikimedia.org/T181538#3796085 (10akosiaris) [11:19:16] 10Operations, 10Patch-For-Review, 10Scoring-platform-team (Current), 10Wikimedia-Incident: Investigate "Asynchronous AOF fsync is taking too long" on oresrdb200* - https://phabricator.wikimedia.org/T181563#3796083 (10akosiaris) 05Open>03Resolved Re-resolving. This has been deployed. The deploy in codfw... [11:21:54] (03PS1) 10Filippo Giunchedi: role: allow prometheus access to mtail on syslog servers [puppet] - 10https://gerrit.wikimedia.org/r/394040 [11:24:24] PROBLEM - Apache HTTP on mw2206 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [11:24:36] (03PS9) 10Elukey: profile::hadoop::common: import hiera config from cdh::hadoop [puppet] - 10https://gerrit.wikimedia.org/r/393756 (https://phabricator.wikimedia.org/T167790) [11:24:57] volans: ok ores seems to have recovered [11:25:03] back to the arp issues [11:25:14] * volans here [11:25:15] RECOVERY - Apache HTTP on mw2206 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 617 bytes in 0.174 second response time [11:25:46] so on cr1-eqiad [11:25:52] 80:18:44:df:d2:00 10.64.32.31 db1110.eqiad.wmnet [11:25:58] which is correct, right ? [11:26:05] yes [11:26:14] and same thing on cr2-eqiad [11:26:22] so those arp caches are fine [11:26:47] but given that we got the errors only on one rack, can it be possible that the physical C6 switch has some local cache? [11:26:48] now the question is why hosts ended up having arp caches that showed a different mac [11:27:00] not show/shared in the virtual row switch [11:27:17] (03PS2) 10Giuseppe Lavagetto: site.pp: remove import of realm.pp [puppet] - 10https://gerrit.wikimedia.org/r/393799 [11:27:19] (03PS1) 10Giuseppe Lavagetto: puppetmaster: default to the future parser [puppet] - 10https://gerrit.wikimedia.org/r/394041 [11:27:21] (03PS1) 10Giuseppe Lavagetto: environment/future: remove redundant settings [puppet] - 10https://gerrit.wikimedia.org/r/394042 [11:27:22] the database help by the switch is in the form "mac => physical port" [11:27:23] (03PS1) 10Giuseppe Lavagetto: profile::base: switch everything back to the default environment [puppet] - 10https://gerrit.wikimedia.org/r/394043 [11:27:26] that's about it [11:27:52] (03CR) 10Filippo Giunchedi: "PCC https://puppet-compiler.wmflabs.org/compiler03/9041/lithium.eqiad.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/394040 (owner: 10Filippo Giunchedi) [11:27:54] (03CR) 10Filippo Giunchedi: [C: 032] role: allow prometheus access to mtail on syslog servers [puppet] - 10https://gerrit.wikimedia.org/r/394040 (owner: 10Filippo Giunchedi) [11:27:55] and that's all it does [11:28:15] so if the hosts had bad arp caches the only thing that explains it (aside from a kernel bug) [11:28:29] is those hosts receiving arp replies that were wrong [11:28:36] gratuitous or solicited [11:28:55] I didn't see replies when my pings were lost, just ARP requests for db1110 and no replies [11:28:57] the question is more about who/what was sending those arp replies than anything else [11:29:16] <_joe_> so the physical host that has the wrong mac being seen is in initramfs. [11:29:23] so db1110 was not responding to arp requests [11:29:33] that's what I am getting by this [11:29:35] <_joe_> so it shouldn't have the network up [11:29:57] (03PS6) 10Jcrespo: mariadb: Setup s8 empty on eqiad [mediawiki-config] - 10https://gerrit.wikimedia.org/r/393808 (https://phabricator.wikimedia.org/T177208) [11:30:10] !log reboot kafka1001 for kernel + jvm updates - T179943 [11:30:17] (03Abandoned) 10Jcrespo: mariadb: Repool db1110 after arp problems with minimal load [mediawiki-config] - 10https://gerrit.wikimedia.org/r/394028 (https://phabricator.wikimedia.org/T165519) (owner: 10Jcrespo) [11:30:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:30:18] T179943: Restart Analytics JVM daemons for open-jdk security updates - https://phabricator.wikimedia.org/T179943 [11:31:24] (03CR) 10jerkins-bot: [V: 04-1] mariadb: Setup s8 empty on eqiad [mediawiki-config] - 10https://gerrit.wikimedia.org/r/393808 (https://phabricator.wikimedia.org/T177208) (owner: 10Jcrespo) [11:31:39] that's mw1329 ? [11:32:26] the one that stole the IP yesterday, yes [11:33:49] what about today ? [11:33:53] (03PS7) 10Jcrespo: mariadb: Setup s8 empty on eqiad [mediawiki-config] - 10https://gerrit.wikimedia.org/r/393808 (https://phabricator.wikimedia.org/T177208) [11:33:57] what was it that you were seeing before ? [11:34:17] ah 18:66:da:99:24:91 [11:34:32] I was in any of the other mw in C6 and saw the wrong ARP cache entry and pinging db1110 going out ARP requests and no reply on tcpdump [11:34:52] ah yes, it was his MAC [11:34:57] akosiaris@asw-c-eqiad> show ethernet-switching table | match 18:66:da:99:24:91 [11:34:57] that returns nothing [11:35:07] so the switch hasn't seen this mac in any port for quite some time [11:35:19] well... quite some time... not sure about fdb timeouts [11:35:22] yesterday when brandon rebooted it [11:37:20] 10Operations, 10ops-eqiad: Please move db1110 and change its ip - https://phabricator.wikimedia.org/T181613#3796119 (10Marostegui) If I can suggest, I would suggest any other rack within the C row if possible, to maintain a decent row distribution across s5 host. [11:38:20] so if there were not arp replies in tcpdump, then db1110 was not responding [11:38:24] PROBLEM - MediaWiki exceptions and fatals per minute on graphite1001 is CRITICAL: CRITICAL: 70.00% of data above the critical threshold [50.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1panelId=2fullscreen [11:39:22] cause arp requests are broadcast and they would have been received by it no matter what [11:40:38] yeah, my wild guess was that the requests were not being routed by the switch to the others and then to db1110 in the first place, but my guess might be biased by an issue I had with some swtich many years ago ;) [11:41:21] well if the switch was losing broadcast packets we are in way bigger problems [11:41:24] RECOVERY - MediaWiki exceptions and fatals per minute on graphite1001 is OK: OK: Less than 70.00% above the threshold [25.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1panelId=2fullscreen [11:41:28] you mentioned having a tcpdump somewhere ? [11:41:49] just few lines who-has... no useful data [11:41:50] (03PS8) 10Jcrespo: mariadb: Setup s8 empty on eqiad [mediawiki-config] - 10https://gerrit.wikimedia.org/r/393808 (https://phabricator.wikimedia.org/T177208) [11:42:07] yeah I want to make sure about the source/destination mac [11:42:28] sorry, forgot to add the extended options :( [11:42:35] mw exceptions seems to be ores related [11:42:41] yeah ores is suffering [11:43:00] well flapping [11:43:01] https://grafana.wikimedia.org/dashboard/db/ores?orgId=1&from=now-30m&to=now-1m [11:43:04] I do not think you have times for the arp issue now [11:43:19] I have asked for the nukler option [11:43:31] I also do not have time to have that server depooled for long [11:43:47] that doesn't mean investigation should not happen [11:43:58] but probably at a better time [11:44:27] jynus: I think it can be repooled, all the mw hosts have the right arp cache and can ping, I can redo joe's test if you want right now [11:44:54] (03CR) 10Marostegui: mariadb: Setup s8 empty on eqiad (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/393808 (https://phabricator.wikimedia.org/T177208) (owner: 10Jcrespo) [11:45:09] so even if there was some other cache it seems that it's updated at this point [11:45:27] I don't care, I do not want to use that ip anymore [11:45:39] once it failed twice [11:45:43] lol [11:46:07] we can pool a dummy server and do further testing [11:46:18] but not with the largest wikidata datbase server [11:48:21] (03PS9) 10Jcrespo: mariadb: Setup s8 empty on eqiad [mediawiki-config] - 10https://gerrit.wikimedia.org/r/393808 (https://phabricator.wikimedia.org/T177208) [11:49:14] sorry to not do things properly, but one has to be practical, given the limited resources and the many outages [11:49:15] (03PS10) 10Elukey: profile::hadoop::common: import hiera config from cdh::hadoop [puppet] - 10https://gerrit.wikimedia.org/r/393756 (https://phabricator.wikimedia.org/T167790) [11:54:27] (03CR) 10Marostegui: [C: 031] mariadb: Setup s8 empty on eqiad [mediawiki-config] - 10https://gerrit.wikimedia.org/r/393808 (https://phabricator.wikimedia.org/T177208) (owner: 10Jcrespo) [12:00:57] ç [12:01:24] !log Deploy schema change on db1072 (sanitarium master) on s3 with replication enabled to replicate to labs - T174569 [12:01:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:01:32] T174569: Schema change for refactored comment storage - https://phabricator.wikimedia.org/T174569 [12:07:10] (03PS1) 10Elukey: [WIP] profile::hadoop::common: add Prometheus JMX exporter configuration [puppet] - 10https://gerrit.wikimedia.org/r/394045 (https://phabricator.wikimedia.org/T177458) [12:09:56] (03PS4) 10Muehlenhoff: Add Prometheus exporter to openldap/labs [puppet] - 10https://gerrit.wikimedia.org/r/394025 [12:10:20] (03CR) 10jerkins-bot: [V: 04-1] Add Prometheus exporter to openldap/labs [puppet] - 10https://gerrit.wikimedia.org/r/394025 (owner: 10Muehlenhoff) [12:18:40] (03CR) 10Jcrespo: [C: 032] mariadb: Setup s8 empty on eqiad [mediawiki-config] - 10https://gerrit.wikimedia.org/r/393808 (https://phabricator.wikimedia.org/T177208) (owner: 10Jcrespo) [12:20:04] (03Merged) 10jenkins-bot: mariadb: Setup s8 empty on eqiad [mediawiki-config] - 10https://gerrit.wikimedia.org/r/393808 (https://phabricator.wikimedia.org/T177208) (owner: 10Jcrespo) [12:20:11] (03CR) 10jenkins-bot: mariadb: Setup s8 empty on eqiad [mediawiki-config] - 10https://gerrit.wikimedia.org/r/393808 (https://phabricator.wikimedia.org/T177208) (owner: 10Jcrespo) [12:20:20] (03PS5) 10Muehlenhoff: Add Prometheus exporter to openldap/labs [puppet] - 10https://gerrit.wikimedia.org/r/394025 [12:22:02] jynus: let me know when deployed to mwdebug1001 and I will test dewiki there too [12:22:34] I am on it :-) [12:22:37] it is on tin already [12:22:38] :) [12:23:05] !log jynus@tin Synchronized wmf-config/db-codfw.php: Pool db1096:3315 (duration: 00m 48s) [12:23:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:23:13] ^that is the codfw change [12:23:35] (03PS1) 10Alexandros Kosiaris: Increase ORES queue_maxsize by 20% [puppet] - 10https://gerrit.wikimedia.org/r/394047 (https://phabricator.wikimedia.org/T181538) [12:23:53] ok [12:24:05] !log scap pull on mwdebug1001 [12:24:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:24:21] (03PS6) 10Muehlenhoff: Add Prometheus exporter to openldap/labs [puppet] - 10https://gerrit.wikimedia.org/r/394025 [12:24:28] not sure it has finished- it says rsync finished [12:24:44] but the process didn't finish? [12:24:59] (03CR) 10Filippo Giunchedi: Add Prometheus exporter to openldap/labs (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/394025 (owner: 10Muehlenhoff) [12:25:15] the file is the new one [12:25:18] on mwdebug1001 [12:25:22] so it is deployed there [12:25:26] db-eqiad.php [12:25:36] ok, now it finished [12:25:37] so, ORES is mildly flapping (entering into an "overload" state). It recovers after a while. I got https://gerrit.wikimedia.org/r/#/c/394047/ up for review and expecting +1/-1s. I am going for lunch, I don't expect service issues during the next hour but page me if needed [12:26:25] wikidata looks fine, I am browsing random pages in mwdebug1001 [12:27:16] same here [12:27:24] I can start a curl in a loop [12:27:31] to try to get errors [12:27:32] going to try dewiki [12:28:50] dbconnection is mostly empty [12:28:56] which is the one I would expect to fail [12:28:57] dewiki also looks fine [12:29:00] that and DBQuery [12:29:02] no issues browsing it randomly [12:29:55] (03PS7) 10Muehlenhoff: Add Prometheus exporter to openldap/labs [puppet] - 10https://gerrit.wikimedia.org/r/394025 [12:30:08] 10Operations, 10Scoring-platform-team, 10Patch-For-Review, 10Wikimedia-Incident: ORES overload incident, 2017-11-28 - https://phabricator.wikimedia.org/T181538#3796190 (10awight) Looks like we'll be doing the same thing today. There have been intermittent overload incidents for the last two hours, during... [12:30:18] let me try to overload dewiki [12:30:51] ok! [12:31:16] failed to connect to db1105:3311 [12:31:19] but only 1 [12:31:22] which is normal [12:31:26] and that is enwiki [12:31:40] yeah, I was going to say that host isn't involved on our change [12:31:40] maybe something to tune later, but not related [12:32:01] I see no connection errors [12:32:06] that host is up and running finely, so not an issue [12:32:13] so I will deploy globally [12:32:16] +1 [12:32:36] I see exceptions, but all from ores [12:33:21] (03CR) 10Awight: "Good idea for a temporary workaround. In the long run, we obviously need to get onto the new machines with more memory." [puppet] - 10https://gerrit.wikimedia.org/r/394037 (https://phabricator.wikimedia.org/T181538) (owner: 10Alexandros Kosiaris) [12:33:39] deploying now s8 [12:33:47] \o/ [12:34:22] !log jynus@tin Synchronized wmf-config/db-eqiad.php: Setup s8 on eqiad, with no wikis (duration: 00m 48s) [12:34:28] there we go [12:34:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:34:31] no error rate issues [12:35:04] PROBLEM - MediaWiki exceptions and fatals per minute on graphite1001 is CRITICAL: CRITICAL: 80.00% of data above the critical threshold [50.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1panelId=2fullscreen [12:35:11] mmm [12:35:13] ores? [12:35:18] hopefully [12:35:23] can't see anything on logtash [12:35:26] related to our change yet [12:35:29] yes, I see ores on logstash [12:35:33] 10Operations, 10Scoring-platform-team, 10monitoring: Investigate scb1001 and scb1002 available memory graphs in Grafana - https://phabricator.wikimedia.org/T181544#3796193 (10awight) @akosiaris And thoughts about how we would troubleshoot this metrics problem? I'd like to review the modules responsible for... [12:35:42] 10Operations, 10monitoring, 10Scoring-platform-team (Current): Investigate scb1001 and scb1002 available memory graphs in Grafana - https://phabricator.wikimedia.org/T181544#3796195 (10awight) [12:35:47] yes, it looks only ores so far [12:36:22] akosiaris: for context- we are not complaining, just making sure we have not added more issues [12:36:43] also almost no text errors [12:36:49] upload and cxserver [12:40:26] <_joe_> so we're having an ores outage? [12:40:53] _joe_: yep. Here’s the tracking task, if you want to play at home: T181538 [12:40:54] T181538: ORES overload incident, 2017-11-28 - https://phabricator.wikimedia.org/T181538 [12:41:13] <_joe_> is someone working on it? [12:41:16] _joe_: some flapping, we were focusing on discarding not new errors from our lastst deployment [12:41:21] <_joe_> just to know if I'm needed or not [12:41:35] I do not think it is outage [12:41:38] not yet [12:41:40] _joe_: I just regained consciousness, I’ll be on it all day. [12:41:49] just some bad requests for now [12:41:53] <_joe_> ok [12:42:03] <_joe_> I'm afk for a while then [12:42:06] We’re hoping that this no longer has any visible effect on user-facing pages. [12:42:07] but based on alex [12:42:13] 's comments [12:42:18] it may be like that for a while [12:42:51] jynus: +1 I can confirm, we worked on this for all of yesterday and it feels no closer to resolution. [12:43:09] we are talking 1 error per minute [12:43:16] so not pathological [12:43:25] we were just thorough on our unrelated deploy [12:43:39] +1 the train is innocent! [12:43:43] probably less [12:44:05] 2 errors per minute with ores + upload + wikidata wuery + phabricator [12:44:28] we just had to check it was not us [12:44:46] Hmm that sounds suspiciously low, actually. I’d expect an error for every edit to an ORES-enabled wiki, during the overload windows. [12:45:45] oh, I had my units bad [12:45:51] more like 10-20 per minute [12:46:20] OK. [12:49:28] Pchelolo: graphoid is emitting a lot of these, just noticed cos I’m looking at the scb machine logs: > Unrecognized interval: undefined [12:49:46] ERROR severity. [12:50:49] awight: I have very limited idea on how graphoid works, but I think this is due to some incorrect input from the clients [12:51:04] RECOVERY - Restbase root url on restbase1012 is OK: HTTP OK: HTTP/1.1 200 - 15785 bytes in 0.011 second response time [12:51:10] Oops, sorry I randomly grabbed you from the github contributors :) [12:51:54] mobrovac: note about some graphoid logspam above ^ [12:52:43] awight: I think the best person would be MaxSem [12:53:03] (but don't tell him that was me who I've sent you to him) [12:53:10] lol! [12:53:14] awight: where do you see the spam? [12:53:27] mobrovac: search `graphoid` in logstash [12:53:44] mobrovac: ^ that. type:graphoid, specifically I was looking at scb1001 [12:54:50] seems like the logs started on 2017-10-26:15:00 [12:55:02] awight: these all seem to be client errors [12:55:30] less than 1 per second [12:56:15] akosiaris: morning :-/. I see that most of the timeout errors have been coming from one machine, scb1003. “overload” errors seem evenly distributed, though. I can look at why that would be, but CPU usage makes it look like something else is happing on 1003. [12:56:44] mobrovac: ok, good to hear! Thanks for checking it out. [12:57:33] akosiaris: available memory also spikes during high-CPU periods…. OOM killer plus something else? [12:57:40] Where do I find OOM killer logs? [12:58:16] things look pretty quiet now on scb1003 [12:58:18] load average: 7.85, 10.31, 10.67 [12:59:06] mobrovac: +1 it’s only a hair above baseline CPU in this graph, https://grafana-admin.wikimedia.org/dashboard/db/ores?orgId=1&from=1511949530450&to=1511960270450 [12:59:32] but you can see the huge CPU spikes during our problem windows. I can’t explain those yet, it’s not supposed to be an ORES behavior. [13:00:44] awight: from experience, ores is prone to have such episodes, the celery workers have a high chance of using 100% of cpu whenever i look at top on scb nodes [13:00:55] akosiaris: FYI I’m also working on another defensive thing, to get our code running on the new cluster. Currently blocked by T181552 [13:00:55] T181552: Git refusing to clone some ORES submodules - https://phabricator.wikimedia.org/T181552 [13:00:59] usually with each of them using at least 3%, 4% of mem [13:01:23] 10Operations, 10Goal, 10User-fgiunchedi, 10cloud-services-team (Kanban): Package PDNS Recursor collector for Prometheus and adapt metrics - https://phabricator.wikimedia.org/T181620#3796244 (10MoritzMuehlenhoff) [13:01:27] mobrovac: thanks, that’s very helpful info. I’ll note that in our bug. [13:03:52] 10Operations, 10Scoring-platform-team, 10Wikimedia-Incident: What is causing ORES celery workers to suddenly require more CPU? - https://phabricator.wikimedia.org/T181621#3796259 (10awight) [13:09:24] PROBLEM - HHVM rendering on mw2216 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:10:15] RECOVERY - HHVM rendering on mw2216 is OK: HTTP OK: HTTP/1.1 200 OK - 74868 bytes in 0.324 second response time [13:12:03] (03PS2) 10Addshore: Enable AdvancedSearch on dewiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/390387 (https://phabricator.wikimedia.org/T180128) [13:12:08] (03PS3) 10Addshore: Enable AdvancedSearch on dewiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/390387 (https://phabricator.wikimedia.org/T180128) [13:12:12] (03CR) 10jerkins-bot: [V: 04-1] Enable AdvancedSearch on dewiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/390387 (https://phabricator.wikimedia.org/T180128) (owner: 10Addshore) [13:12:35] 10Operations, 10Scoring-platform-team, 10Wikimedia-Incident: What is causing ORES celery workers to suddenly require more CPU? - https://phabricator.wikimedia.org/T181621#3796299 (10awight) @akosiaris I'd like to get our celery logs routed to logstash, at INFO level. We could just pipe into a file too, per... [13:15:07] (03PS4) 10Addshore: Enable AdvancedSearch in Arabic Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/393369 (https://phabricator.wikimedia.org/T180291) (owner: 10Jayprakash12345) [13:15:26] (03PS6) 10Arturo Borrero Gonzalez: apt: add --force-confold dpkg option to apt calls [puppet] - 10https://gerrit.wikimedia.org/r/392421 (https://phabricator.wikimedia.org/T180811) [13:17:45] !log awight@tin Started deploy [ores/deploy@e58bfbf]: (non-production) Update ORES on new cluster (take 3) [13:17:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:18:13] !log reboot kafka100[23] for jvm+kernel updates - T179943 [13:18:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:18:21] T179943: Restart Analytics JVM daemons for open-jdk security updates - https://phabricator.wikimedia.org/T179943 [13:18:37] (03PS2) 10Elukey: [WIP] profile::hadoop::common: add Prometheus JMX exporter configuration [puppet] - 10https://gerrit.wikimedia.org/r/394045 (https://phabricator.wikimedia.org/T177458) [13:18:54] (03CR) 10jerkins-bot: [V: 04-1] [WIP] profile::hadoop::common: add Prometheus JMX exporter configuration [puppet] - 10https://gerrit.wikimedia.org/r/394045 (https://phabricator.wikimedia.org/T177458) (owner: 10Elukey) [13:21:14] (03PS3) 10Elukey: [WIP] profile::hadoop::common: add Prometheus JMX exporter configuration [puppet] - 10https://gerrit.wikimedia.org/r/394045 (https://phabricator.wikimedia.org/T177458) [13:21:20] (03CR) 10Faidon Liambotis: apt: add --force-confold dpkg option to apt calls (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/392421 (https://phabricator.wikimedia.org/T180811) (owner: 10Arturo Borrero Gonzalez) [13:23:28] 10Operations, 10ORES, 10Release-Engineering-Team, 10Scoring-platform-team (Current), and 2 others: Git refusing to clone some ORES submodules - https://phabricator.wikimedia.org/T181552#3796325 (10awight) @thcipriani @mmodell Is the fix for T179013 deployed to production? I'm hoping the fix will be that s... [13:29:08] 10Operations, 10Goal, 10User-fgiunchedi, 10cloud-services-team (Kanban): Package PDNS Recursor collector for Prometheus and adapt metrics - https://phabricator.wikimedia.org/T181620#3796328 (10MoritzMuehlenhoff) [13:29:22] paravoid: ACK, will check in deep [13:33:30] (03PS4) 10Elukey: [WIP] profile::hadoop::common: add Prometheus JMX exporter configuration [puppet] - 10https://gerrit.wikimedia.org/r/394045 (https://phabricator.wikimedia.org/T177458) [13:34:33] !log awight@tin Finished deploy [ores/deploy@e58bfbf]: (non-production) Update ORES on new cluster (take 3) (duration: 16m 49s) [13:34:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:35:34] PROBLEM - Restbase root url on restbase1012 is CRITICAL: connect to address 10.64.32.79 and port 7231: Connection refused [13:36:30] that's ok ^ [13:38:59] (03PS1) 10Filippo Giunchedi: cassandra: reprovision restbase1012 with cassandra 3 [puppet] - 10https://gerrit.wikimedia.org/r/394053 (https://phabricator.wikimedia.org/T179422) [13:41:07] (03CR) 10Filippo Giunchedi: [C: 04-1] "DNM yet" [puppet] - 10https://gerrit.wikimedia.org/r/394053 (https://phabricator.wikimedia.org/T179422) (owner: 10Filippo Giunchedi) [13:42:22] (03CR) 10KartikMistry: "recheck" [debs/contenttranslation/apertium-crh-tur] - 10https://gerrit.wikimedia.org/r/393993 (https://phabricator.wikimedia.org/T181465) (owner: 10KartikMistry) [13:42:33] (03CR) 10jerkins-bot: [V: 04-1] apertium-crh-tur: New upstream release [debs/contenttranslation/apertium-crh-tur] - 10https://gerrit.wikimedia.org/r/393993 (https://phabricator.wikimedia.org/T181465) (owner: 10KartikMistry) [13:49:57] (03PS5) 10Elukey: [WIP] profile::hadoop::common: add Prometheus JMX exporter configuration [puppet] - 10https://gerrit.wikimedia.org/r/394045 (https://phabricator.wikimedia.org/T177458) [13:55:36] (03PS2) 10KartikMistry: apertium-crh-tur: New upstream release [debs/contenttranslation/apertium-crh-tur] - 10https://gerrit.wikimedia.org/r/393993 (https://phabricator.wikimedia.org/T181465) [13:55:50] halfak: akosiaris: Nov 29 10:32:59 scb1003 celery-ores-worker[13231]: MemoryError: [Errno 12] Cannot allocate memory [13:56:02] (03CR) 10jerkins-bot: [V: 04-1] apertium-crh-tur: New upstream release [debs/contenttranslation/apertium-crh-tur] - 10https://gerrit.wikimedia.org/r/393993 (https://phabricator.wikimedia.org/T181465) (owner: 10KartikMistry) [13:56:32] Can anyone tell me how to read OOM killer logs? [13:56:44] awight: speaking of logs, ores' app.log contains loads of exceptions, but they are not time-stamped, it would be useful if they were [13:57:40] mobrovac: +1, thanks for the suggestion. I’ll fix the formatter. I actually don’t read the logfiles cos I don’t have permissions to see main.log. I just read in logstash. [13:58:52] awight: app.log is readable by world, so you can read that one [13:58:59] yes [13:59:12] awight: and it's also pretty silly that main.log is not readable by world, that ought to be fixed imho [13:59:28] I see something about oom_kill going to syslog/kernel, where do I find that… [14:00:04] addshore, hashar, anomie, RainbowSprinkles, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, thcipriani, Niharika, and zeljkof: #bothumor My software never has bugs. It just develops random features. Rise for European Mid-day SWAT(Max 8 patches). (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20171129T1400). [14:00:05] Urbanecm and Addshore: A patch you scheduled for European Mid-day SWAT(Max 8 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [14:00:30] I can SWAT today [14:00:54] Urbanecm, addshore: do you want to deploy your changes, or should I? [14:01:43] zeljkof: i can do mine! [14:01:56] addshore: want to go first? [14:02:01] can do! [14:02:08] go ahead [14:02:19] (03CR) 10Addshore: [C: 032] Enable AdvancedSearch on dewiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/390387 (https://phabricator.wikimedia.org/T180128) (owner: 10Addshore) [14:02:22] doing [14:02:44] PROBLEM - Restbase root url on restbase1014 is CRITICAL: connect to address 10.64.48.133 and port 7231: Connection refused [14:03:33] uh oh, wow much ores errors in logs, so scary [14:03:37] zeljkof: indeed [14:03:53] (03Merged) 10jenkins-bot: Enable AdvancedSearch on dewiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/390387 (https://phabricator.wikimedia.org/T180128) (owner: 10Addshore) [14:04:47] is ores outage happening? [14:05:16] zeljkof: yep. If you’re curious, T181538 [14:05:16] T181538: ORES overload incident, 2017-11-28 - https://phabricator.wikimedia.org/T181538 [14:05:31] awight: thanks [14:06:07] 10Operations, 10Scoring-platform-team, 10Patch-For-Review, 10Wikimedia-Incident: ORES overload incident, 2017-11-28 - https://phabricator.wikimedia.org/T181538#3796413 (10awight) Most of our worker boxes are down. Here's the app.log from one that's down, last written to 2 hours ago: ``` Connection to Red... [14:06:15] (03PS2) 10Filippo Giunchedi: cassandra: reprovision restbase1012 with cassandra 3 [puppet] - 10https://gerrit.wikimedia.org/r/394053 (https://phabricator.wikimedia.org/T179422) [14:06:22] Urbanecm: around for EU SWAT? [14:06:44] AFAICT, we're receiving an extraordinary request load, we're struggling with some connections to redis, we're reaching OOM on scb nodes and MW is sometimes internally DOSing us with requests :) [14:06:54] Everything seems to be going wrong at once :D [14:06:58] (03CR) 10Filippo Giunchedi: [C: 032] cassandra: reprovision restbase1012 with cassandra 3 [puppet] - 10https://gerrit.wikimedia.org/r/394053 (https://phabricator.wikimedia.org/T179422) (owner: 10Filippo Giunchedi) [14:07:01] !log addshore@tin Synchronized wmf-config/InitialiseSettings.php: SWAT T180128 [[gerrit:390387|AdvancedSearch for dewiki]] (duration: 00m 50s) [14:07:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:07:08] T180128: Make AdvancedSearch a beta feature on de-wiki and ar-wiki - https://phabricator.wikimedia.org/T180128 [14:07:22] (03CR) 10Addshore: [C: 032] Enable AdvancedSearch in Arabic Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/393369 (https://phabricator.wikimedia.org/T180291) (owner: 10Jayprakash12345) [14:07:45] (03CR) 10jenkins-bot: Enable AdvancedSearch on dewiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/390387 (https://phabricator.wikimedia.org/T180128) (owner: 10Addshore) [14:08:40] (03Merged) 10jenkins-bot: Enable AdvancedSearch in Arabic Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/393369 (https://phabricator.wikimedia.org/T180291) (owner: 10Jayprakash12345) [14:09:53] (03PS6) 10Elukey: [WIP] profile::hadoop::common: add Prometheus JMX exporter configuration [puppet] - 10https://gerrit.wikimedia.org/r/394045 (https://phabricator.wikimedia.org/T177458) [14:10:52] !log addshore@tin Synchronized wmf-config/InitialiseSettings.php: SWAT T180291 T180128 [[gerrit:393369|AdvancedSearch for arwiki]] (duration: 00m 49s) [14:10:59] (03CR) 10jenkins-bot: Enable AdvancedSearch in Arabic Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/393369 (https://phabricator.wikimedia.org/T180291) (owner: 10Jayprakash12345) [14:11:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:11:00] T180291: beta test AdvancedSearch in Arabic Wikipedia - https://phabricator.wikimedia.org/T180291 [14:11:22] zeljkof: thats me all done [14:11:31] addshore: great! [14:11:43] looks like Urbanecm is not around... [14:12:47] !log reimage restbase1012 - T179422 [14:12:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:12:54] T179422: Reshape RESTBase Cassandra clusters - https://phabricator.wikimedia.org/T179422 [14:12:57] zeljkof: https://gerrit.wikimedia.org/r/#/c/393722/1/wmf-config/throttle.php you can deploy it [14:13:16] hasharAway: yes, throttle rule looks simple, nothing to test [14:13:19] (03CR) 10Hashar: [C: 031] Define throttle rule [mediawiki-config] - 10https://gerrit.wikimedia.org/r/393722 (https://phabricator.wikimedia.org/T181367) (owner: 10Urbanecm) [14:13:30] !log rebooting neodymium for kernel update to 4.9.51 [14:13:33] https://gerrit.wikimedia.org/r/#/c/393835/1/wmf-config/InitialiseSettings.php usually it is easy [14:13:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:13:49] maybe we gotta run the namespaceDupes.php script before and check whether the wiki namespaces are all fine [14:13:52] (03PS7) 10Elukey: [WIP] profile::hadoop::common: add Prometheus JMX exporter configuration [puppet] - 10https://gerrit.wikimedia.org/r/394045 (https://phabricator.wikimedia.org/T177458) [14:14:02] then I usually scap pull on terbium [14:14:07] and run namespaceDupes.php there again [14:14:18] !log beginning cut over of cache::misc and codfw cache::canary cp servers to codfw puppet4 masters [14:14:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:14:31] herron: \o/ [14:15:37] (03CR) 10Zfilipin: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/393722 (https://phabricator.wikimedia.org/T181367) (owner: 10Urbanecm) [14:15:44] (03PS8) 10Elukey: [WIP] profile::hadoop::common: add Prometheus JMX exporter configuration [puppet] - 10https://gerrit.wikimedia.org/r/394045 (https://phabricator.wikimedia.org/T177458) [14:16:57] (03Merged) 10jenkins-bot: Define throttle rule [mediawiki-config] - 10https://gerrit.wikimedia.org/r/393722 (https://phabricator.wikimedia.org/T181367) (owner: 10Urbanecm) [14:17:47] (03PS1) 10Giuseppe Lavagetto: git::install: fix conditional for puppet 4 [puppet] - 10https://gerrit.wikimedia.org/r/394056 [14:18:36] !log zfilipin@tin Synchronized wmf-config/throttle.php: SWAT: [[gerrit:393722|Define throttle rule (T181367)]] (duration: 00m 49s) [14:18:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:18:43] T181367: Account creation throttle exemption for workshop on 2017-12-01 - https://phabricator.wikimedia.org/T181367 [14:18:58] (03CR) 10Giuseppe Lavagetto: [C: 032] git::install: fix conditional for puppet 4 [puppet] - 10https://gerrit.wikimedia.org/r/394056 (owner: 10Giuseppe Lavagetto) [14:19:03] 10Operations, 10monitoring, 10Scoring-platform-team (Current): Investigate scb1001 and scb1002 available memory graphs in Grafana - https://phabricator.wikimedia.org/T181544#3796458 (10awight) Here's a graph that shows the OOM ceiling correctly, for comparison: https://grafana.wikimedia.org/dashboard/file/se... [14:19:12] 10Operations, 10Ops-Access-Requests: Requesting access to terbium.eqiad.wmnet for cparle - https://phabricator.wikimedia.org/T181626#3796459 (10Cparle) [14:19:40] hasharAway: I would rather not deploy 393835 [14:19:49] I don't know how to test it [14:19:54] (03PS1) 10Marostegui: s8.hosts: Create file with the s8 hosts [software] - 10https://gerrit.wikimedia.org/r/394057 (https://phabricator.wikimedia.org/T177208) [14:20:14] we can deploy if Urbanecm comes during EU SWAT [14:21:20] (03PS9) 10Elukey: [WIP] profile::hadoop::common: add Prometheus JMX exporter configuration [puppet] - 10https://gerrit.wikimedia.org/r/394045 (https://phabricator.wikimedia.org/T177458) [14:21:56] 10Operations, 10Goal, 10User-fgiunchedi, 10cloud-services-team (Kanban): Port elasticsearch metrics to Prometheus - https://phabricator.wikimedia.org/T181627#3796474 (10fgiunchedi) [14:22:12] (03PS5) 10Herron: puppet: point codfw misc and canary cp hosts at codfw puppet4 masters [puppet] - 10https://gerrit.wikimedia.org/r/392676 (https://phabricator.wikimedia.org/T177254) [14:22:32] 10Operations, 10Goal, 10Patch-For-Review, 10User-fgiunchedi, 10cloud-services-team (Kanban): Port non-deprecated Diamond collectors to Prometheus - https://phabricator.wikimedia.org/T177196#3650139 (10fgiunchedi) [14:22:48] (03CR) 10Herron: [C: 032] puppet: point codfw misc and canary cp hosts at codfw puppet4 masters [puppet] - 10https://gerrit.wikimedia.org/r/392676 (https://phabricator.wikimedia.org/T177254) (owner: 10Herron) [14:22:53] (03PS2) 10Marostegui: s8.hosts: Add eqiad hosts [software] - 10https://gerrit.wikimedia.org/r/394057 (https://phabricator.wikimedia.org/T177208) [14:24:28] (03CR) 10Elukey: "@Ottomata: this seems to do what we need (https://puppet-compiler.wmflabs.org/compiler02/9056/analytics1040.eqiad.wmnet/) and it should be" [puppet] - 10https://gerrit.wikimedia.org/r/394045 (https://phabricator.wikimedia.org/T177458) (owner: 10Elukey) [14:25:21] !log EU SWAT finished [14:25:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:25:27] 10Operations, 10Scoring-platform-team, 10Patch-For-Review, 10Wikimedia-Incident: ORES overload incident, 2017-11-28 - https://phabricator.wikimedia.org/T181538#3796496 (10awight) oresrdb1001 is dying: ``` [09:22am] akosiaris: [761 | signal handler] (1511951211) Received SIGTERM scheduling shutdown... [09:2... [14:25:58] (03CR) 10Zfilipin: "Scheduled for EU SWAT, but Urbanecm was not around so it was not deployed. Please reschedule." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/393835 (https://phabricator.wikimedia.org/T181374) (owner: 10Urbanecm) [14:27:03] (03PS1) 10Jon Harald Søby: Add "Prosjekt" namespace to nowikimedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/394059 (https://phabricator.wikimedia.org/T181625) [14:28:45] (03CR) 10Ottomata: profile::hadoop::common: import hiera config from cdh::hadoop (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/393756 (https://phabricator.wikimedia.org/T167790) (owner: 10Elukey) [14:29:50] ottomata: --verbose? :) [14:30:36] (03CR) 10Jcrespo: [C: 031] s8.hosts: Add eqiad hosts [software] - 10https://gerrit.wikimedia.org/r/394057 (https://phabricator.wikimedia.org/T177208) (owner: 10Marostegui) [14:30:40] ah you meant that I should put in the profile the defaults from cdh [14:30:46] (03CR) 10Ottomata: [WIP] profile::hadoop::common: add Prometheus JMX exporter configuration (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/394045 (https://phabricator.wikimedia.org/T177458) (owner: 10Elukey) [14:30:58] elukey: ya! that way labs will work without (much) overriding [14:31:08] 10Operations, 10Ops-Access-Requests: Requesting access to terbium.eqiad.wmnet for cparle - https://phabricator.wikimedia.org/T181626#3796513 (10Gilles) With @Cparle working on the backend of Multimedia files, the necessity to run maintenance script is kind of inevitable for that sort of task. His work in that... [14:31:33] (03CR) 10Jcrespo: [C: 031] "This is correct, although we should followup" [puppet] - 10https://gerrit.wikimedia.org/r/393725 (https://phabricator.wikimedia.org/T174569) (owner: 10Marostegui) [14:32:04] (03PS7) 10Arturo Borrero Gonzalez: apt: add --force-confold dpkg option to apt calls [puppet] - 10https://gerrit.wikimedia.org/r/392421 (https://phabricator.wikimedia.org/T180811) [14:32:31] (03CR) 10Jcrespo: [C: 031] "I am going to try to deploy this today and not break anything in the process." [puppet] - 10https://gerrit.wikimedia.org/r/375347 (https://phabricator.wikimedia.org/T113842) (owner: 10Tpt) [14:32:33] (03CR) 10Elukey: [WIP] profile::hadoop::common: add Prometheus JMX exporter configuration (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/394045 (https://phabricator.wikimedia.org/T177458) (owner: 10Elukey) [14:32:38] (03PS2) 10Giuseppe Lavagetto: puppetmaster: default to the future parser [puppet] - 10https://gerrit.wikimedia.org/r/394041 [14:32:40] (03PS3) 10Giuseppe Lavagetto: site.pp: remove import of realm.pp [puppet] - 10https://gerrit.wikimedia.org/r/393799 [14:32:42] (03PS2) 10Giuseppe Lavagetto: environment/future: remove redundant settings [puppet] - 10https://gerrit.wikimedia.org/r/394042 [14:32:44] (03PS2) 10Giuseppe Lavagetto: profile::base: switch everything back to the default environment [puppet] - 10https://gerrit.wikimedia.org/r/394043 [14:32:49] ottomata: all right got it, thanks :) [14:34:15] 10Operations, 10Scoring-platform-team, 10monitoring, 10Wikimedia-Incident: Send celery and wsgi service logs to logstash - https://phabricator.wikimedia.org/T181630#3796535 (10awight) [14:34:40] (03CR) 10Giuseppe Lavagetto: [C: 032] puppetmaster: default to the future parser [puppet] - 10https://gerrit.wikimedia.org/r/394041 (owner: 10Giuseppe Lavagetto) [14:34:49] (03CR) 10jenkins-bot: Define throttle rule [mediawiki-config] - 10https://gerrit.wikimedia.org/r/393722 (https://phabricator.wikimedia.org/T181367) (owner: 10Urbanecm) [14:35:42] !log awight@tin Started deploy [ores/deploy@e58bfbf]: Restart ORES services, T181538 [14:35:43] (03PS1) 10Ema: prometheus: add varnish-canary job definition [puppet] - 10https://gerrit.wikimedia.org/r/394063 [14:35:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:35:49] T181538: ORES overload incident, 2017-11-28 - https://phabricator.wikimedia.org/T181538 [14:35:58] !log awight@tin Finished deploy [ores/deploy@e58bfbf]: Restart ORES services, T181538 (duration: 00m 16s) [14:36:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:36:25] !log reboot druid100[456] for jvm+kernel updates - T179943 [14:36:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:36:32] T179943: Restart Analytics JVM daemons for open-jdk security updates - https://phabricator.wikimedia.org/T179943 [14:37:19] (03PS8) 10Arturo Borrero Gonzalez: apt: add --force-confold/--force-confdef dpkg option to apt calls [puppet] - 10https://gerrit.wikimedia.org/r/392421 (https://phabricator.wikimedia.org/T180811) [14:37:27] !log awight@tin Started restart [ores/deploy@e58bfbf]: Restart ORES services (take 2), T181538 [14:37:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:39:39] <_joe_> !log reloading apache on puppetmasters to pick up the configuration changes [14:39:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:39:50] herron: troubles with v4 puppetmaster and cp1008? [14:40:22] <_joe_> ema:it could be related to my reloads [14:41:38] (03PS1) 10Herron: puppet: update hiera function calls in varnish normalize_path [puppet] - 10https://gerrit.wikimedia.org/r/394065 (https://phabricator.wikimedia.org/T179181) [14:42:05] <_joe_> oh, my [14:42:12] <_joe_> seriously why hiera calls there? [14:42:40] ema yeah ran into function_hiera calls in varnish normalize_path that need updating [14:42:47] <_joe_> herron: -1 to your change [14:43:37] (03CR) 10Giuseppe Lavagetto: [C: 04-1] "Since hiera calls in templates are evil, let's use the global variable $cluster here instead." (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/394065 (https://phabricator.wikimedia.org/T179181) (owner: 10Herron) [14:44:29] PROBLEM - puppet last run on mw2178 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 4 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/etc/apache2/sites-available/06-secure-wikimedia.conf] [14:46:00] PROBLEM - puppet last run on prometheus2004 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [14:46:48] (03PS2) 10Herron: puppet: remove hiera calls from varnish normalize_path [puppet] - 10https://gerrit.wikimedia.org/r/394065 (https://phabricator.wikimedia.org/T179181) [14:47:20] (03CR) 10Ottomata: [WIP] profile::hadoop::common: add Prometheus JMX exporter configuration (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/394045 (https://phabricator.wikimedia.org/T177458) (owner: 10Elukey) [14:49:03] _joe_ better? [14:49:29] RECOVERY - puppet last run on mw2178 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [14:51:31] <_joe_> herron: try it in the puppet compiler first, and ask ema/bblack for their blessing, but yes [14:52:10] PROBLEM - Disk space on labtestcontrol2001 is CRITICAL: DISK CRITICAL - free space: / 0 MB (0% inode=65%) [14:52:30] <_joe_> arturo: [14:52:31] <_joe_> ^^ [14:52:44] <_joe_> labtestcontrol has a full disk, fyi [14:54:20] PROBLEM - HHVM rendering on mw2205 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:54:40] (03CR) 10Herron: "https://puppet-compiler.wmflabs.org/compiler02/9059/" [puppet] - 10https://gerrit.wikimedia.org/r/394065 (https://phabricator.wikimedia.org/T179181) (owner: 10Herron) [14:55:10] RECOVERY - HHVM rendering on mw2205 is OK: HTTP OK: HTTP/1.1 200 OK - 74866 bytes in 0.292 second response time [14:55:58] (03CR) 10Giuseppe Lavagetto: [C: 031] puppet: remove hiera calls from varnish normalize_path [puppet] - 10https://gerrit.wikimedia.org/r/394065 (https://phabricator.wikimedia.org/T179181) (owner: 10Herron) [14:58:17] 10Operations, 10Scoring-platform-team, 10Wikimedia-Incident: Celery manager implodes horribly if Redis goes down - https://phabricator.wikimedia.org/T181632#3796581 (10awight) [14:58:45] (03PS11) 10Elukey: profile::hadoop::common: import hiera config from cdh::hadoop [puppet] - 10https://gerrit.wikimedia.org/r/393756 (https://phabricator.wikimedia.org/T167790) [14:58:50] 10Operations, 10Scoring-platform-team, 10Wikimedia-Incident: Celery manager implodes horribly if Redis goes down - https://phabricator.wikimedia.org/T181632#3796595 (10awight) [14:59:00] OK _joe_ will take a look after lunch, CC chasemp [14:59:39] thanks, I'm on it _joe_ arturo (not sure why atm) [15:00:30] !log Restarting ORES celery workers manually, T181538 [15:00:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:00:38] T181538: ORES overload incident, 2017-11-28 - https://phabricator.wikimedia.org/T181538 [15:00:49] RECOVERY - MediaWiki exceptions and fatals per minute on graphite1001 is OK: OK: Less than 70.00% above the threshold [25.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1panelId=2fullscreen [15:01:19] RECOVERY - Disk space on labtestcontrol2001 is OK: DISK OK [15:02:15] !log purge old qcow2 images from under /home on labtestcontrol2001 [15:02:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:05:49] PROBLEM - puppet last run on puppetmaster2002 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:05:52] (03PS12) 10Elukey: profile::hadoop::common: import hiera config from cdh::hadoop [puppet] - 10https://gerrit.wikimedia.org/r/393756 (https://phabricator.wikimedia.org/T167790) [15:08:28] mhhh I just reimaged restbase1012 and the first puppet run fails with this [15:08:35] Error: Could not retrieve catalog from remote server: Error 400 on SERVER: Evaluation Error: Use of 'import' has been discontinued in favor of a manifest directory. See http://links.puppetlabs.com/puppet-import-deprecation at /etc/puppet/manifests/site.pp:4:5 on node restbase1012.eqiad.wmnet [15:08:53] I suppose related to the latest change puppet 3/4 change, _joe_ herron [15:10:29] PROBLEM - puppet last run on puppetmaster2001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:10:46] (03PS13) 10Elukey: profile::hadoop::common: import hiera config from cdh::hadoop [puppet] - 10https://gerrit.wikimedia.org/r/393756 (https://phabricator.wikimedia.org/T167790) [15:11:00] RECOVERY - puppet last run on prometheus2004 is OK: OK: Puppet is currently enabled, last run 33 seconds ago with 0 failures [15:11:10] godog hmm looking [15:11:56] 10Operations, 10Scoring-platform-team, 10Wikimedia-Incident: Investigate overload condition, seems that we lose nodes - https://phabricator.wikimedia.org/T181634#3796634 (10awight) [15:12:38] 10Operations, 10Scoring-platform-team, 10Wikimedia-Incident: What is causing ORES celery workers to suddenly require more CPU? - https://phabricator.wikimedia.org/T181621#3796649 (10akosiaris) https://gerrit.wikimedia.org/r/394060 [15:12:48] godog: herron not knowing anything I imagine it's that the server has environment = future and the client does not and that trips on line 4 of site.pp [15:12:56] 10Operations, 10Scoring-platform-team, 10Wikimedia-Incident: What is causing ORES celery workers to suddenly require more CPU? - https://phabricator.wikimedia.org/T181621#3796651 (10akosiaris) And yes let's send these logs to logstash!!! [15:14:10] FYI, ORES is fine again [15:14:12] chasemp: intriguing theory [15:15:14] (03CR) 10Elukey: "@Ottomata: better this time? (pcc https://puppet-compiler.wmflabs.org/compiler02/9062/)" [puppet] - 10https://gerrit.wikimedia.org/r/393756 (https://phabricator.wikimedia.org/T167790) (owner: 10Elukey) [15:15:29] PROBLEM - Restbase root url on restbase1012 is CRITICAL: connect to address 10.64.32.79 and port 7231: Connection refused [15:15:29] PROBLEM - configured eth on restbase1012 is CRITICAL: Return code of 255 is out of bounds [15:16:34] that's me ^ silencing [15:16:57] godog: if you add environment = future to [main] in puppet.conf I bet it runs clean. Not sure if it's actually desired state but it's a test :) [15:17:44] chasemp: thanks! yeah herron was taking a look too so cc ^ [15:18:16] I'll hold off for now, not overly urgent to get puppet running there [15:23:25] (03CR) 10Ottomata: [C: 031] profile::hadoop::common: import hiera config from cdh::hadoop (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/393756 (https://phabricator.wikimedia.org/T167790) (owner: 10Elukey) [15:26:29] chasemp: you are indeed correct fine sir [15:29:28] PROBLEM - IPv6 ping to ulsfo on ripe-atlas-ulsfo IPv6 is CRITICAL: CRITICAL - failed 23 probes of 282 (alerts on 19) - https://atlas.ripe.net/measurements/1791309/#!map [15:30:00] (03CR) 10Elukey: profile::hadoop::common: import hiera config from cdh::hadoop (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/393756 (https://phabricator.wikimedia.org/T167790) (owner: 10Elukey) [15:32:40] (03CR) 10Alexandros Kosiaris: [C: 031] "Look great to me" [puppet] - 10https://gerrit.wikimedia.org/r/394025 (owner: 10Muehlenhoff) [15:33:24] 10Operations, 10ops-eqiad: Possible memory errors on ganeti1005, ganeti1006 - https://phabricator.wikimedia.org/T181121#3796723 (10Cmjohnson) a:05Cmjohnson>03akosiaris The h/w tests are finished and no errors were found. assigning to @akosiaris [15:34:28] RECOVERY - IPv6 ping to ulsfo on ripe-atlas-ulsfo IPv6 is OK: OK - failed 12 probes of 282 (alerts on 19) - https://atlas.ripe.net/measurements/1791309/#!map [15:34:58] PROBLEM - Varnish HTTP text-backend - port 3128 on cp4027 is CRITICAL: connect to address 10.128.0.127 and port 3128: Connection refused [15:35:58] RECOVERY - Varnish HTTP text-backend - port 3128 on cp4027 is OK: HTTP OK: HTTP/1.1 200 OK - 178 bytes in 0.157 second response time [15:37:28] (03CR) 10Herron: [C: 04-1] "bblack pointed out this would break cache_upload https://puppet-compiler.wmflabs.org/compiler02/9065/cp1099.eqiad.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/394065 (https://phabricator.wikimedia.org/T179181) (owner: 10Herron) [15:38:48] RECOVERY - configured eth on restbase1012 is OK: OK - interfaces up [15:40:45] (03CR) 10Ottomata: [C: 031] [WIP] profile::hadoop::common: add Prometheus JMX exporter configuration [puppet] - 10https://gerrit.wikimedia.org/r/394045 (https://phabricator.wikimedia.org/T177458) (owner: 10Elukey) [15:40:51] (03PS2) 10Jon Harald Søby: Add "Prosjekt" namespace to nowikimedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/394059 (https://phabricator.wikimedia.org/T181625) [15:40:58] (03PS3) 10Herron: puppet: update hiera function calls in varnish normalize_path [puppet] - 10https://gerrit.wikimedia.org/r/394065 (https://phabricator.wikimedia.org/T179181) [15:42:36] 10Operations, 10monitoring, 10Scoring-platform-team (Current): Investigate scb1001 and scb1002 available memory graphs in Grafana - https://phabricator.wikimedia.org/T181544#3796764 (10akosiaris) Now that I had some time to view those graphs (that is https://grafana.wikimedia.org/dashboard/db/ores?orgId=1&fr... [15:53:06] (03PS1) 10Cmjohnson: Changing production dns for db1110 [dns] - 10https://gerrit.wikimedia.org/r/394073 [15:53:38] (03CR) 10Cmjohnson: [C: 032] Changing production dns for db1110 [dns] - 10https://gerrit.wikimedia.org/r/394073 (owner: 10Cmjohnson) [15:57:09] (03PS1) 10Jcrespo: mariadb: Update db1110 ip [mediawiki-config] - 10https://gerrit.wikimedia.org/r/394075 (https://phabricator.wikimedia.org/T181613) [16:09:32] (03PS9) 10Arturo Borrero Gonzalez: apt: add --force-confold/--force-confdef dpkg option to apt calls [puppet] - 10https://gerrit.wikimedia.org/r/392421 (https://phabricator.wikimedia.org/T180811) [16:14:18] 10Operations, 10Goal, 10Patch-For-Review: Provide dedicated database resources for wikidata - https://phabricator.wikimedia.org/T177208#3796866 (10bd808) [16:18:25] RECOVERY - Kafka Broker Under Replicated Partitions on kafka1020 is OK: OK: Less than 50.00% above the threshold [1.0] https://grafana.wikimedia.org/dashboard/db/kafka?panelId=29fullscreenorgId=1 [16:19:58] (03CR) 10Elukey: [C: 032] profile::hadoop::common: import hiera config from cdh::hadoop [puppet] - 10https://gerrit.wikimedia.org/r/393756 (https://phabricator.wikimedia.org/T167790) (owner: 10Elukey) [16:25:16] (03CR) 10Jcrespo: [C: 032] mariadb: Update db1110 ip [mediawiki-config] - 10https://gerrit.wikimedia.org/r/394075 (https://phabricator.wikimedia.org/T181613) (owner: 10Jcrespo) [16:26:41] (03Merged) 10jenkins-bot: mariadb: Update db1110 ip [mediawiki-config] - 10https://gerrit.wikimedia.org/r/394075 (https://phabricator.wikimedia.org/T181613) (owner: 10Jcrespo) [16:27:09] (03CR) 10jenkins-bot: mariadb: Update db1110 ip [mediawiki-config] - 10https://gerrit.wikimedia.org/r/394075 (https://phabricator.wikimedia.org/T181613) (owner: 10Jcrespo) [16:30:38] (03CR) 10Ema: [C: 031] "Looks good." [puppet] - 10https://gerrit.wikimedia.org/r/394065 (https://phabricator.wikimedia.org/T179181) (owner: 10Herron) [16:31:24] (03PS4) 10Herron: puppet: update hiera function calls in varnish normalize_path [puppet] - 10https://gerrit.wikimedia.org/r/394065 (https://phabricator.wikimedia.org/T179181) [16:32:24] (03CR) 10Herron: [C: 032] puppet: update hiera function calls in varnish normalize_path [puppet] - 10https://gerrit.wikimedia.org/r/394065 (https://phabricator.wikimedia.org/T179181) (owner: 10Herron) [16:32:52] (03CR) 10Marostegui: [C: 032] s8.hosts: Add eqiad hosts [software] - 10https://gerrit.wikimedia.org/r/394057 (https://phabricator.wikimedia.org/T177208) (owner: 10Marostegui) [16:33:44] (03CR) 10Rush: [C: 031] apt: add --force-confold/--force-confdef dpkg option to apt calls [puppet] - 10https://gerrit.wikimedia.org/r/392421 (https://phabricator.wikimedia.org/T180811) (owner: 10Arturo Borrero Gonzalez) [16:33:47] (03Merged) 10jenkins-bot: s8.hosts: Add eqiad hosts [software] - 10https://gerrit.wikimedia.org/r/394057 (https://phabricator.wikimedia.org/T177208) (owner: 10Marostegui) [16:33:51] (03PS10) 10Rush: apt: add --force-confold/--force-confdef dpkg option to apt calls [puppet] - 10https://gerrit.wikimedia.org/r/392421 (https://phabricator.wikimedia.org/T180811) (owner: 10Arturo Borrero Gonzalez) [16:34:53] paravoid: ^^^ ready for review again, thanks for your time :-) [16:35:17] oh hi [16:35:19] looking :) [16:37:58] have you tested this? [16:38:04] is the trailing :: from the option needed? [16:39:38] arturo: ^ :) [16:39:53] yes, I tested this [16:40:03] ok, cool [16:40:05] +1 :) [16:40:17] thanks paravoid ! :-) [16:40:25] yvw, anytime :) [16:43:07] !log jynus@tin Synchronized wmf-config/db-eqiad.php: Update db1110 ip (duration: 00m 49s) [16:43:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:46:50] !log jynus@tin Synchronized wmf-config/db-codfw.php: Update db1110 ip (duration: 00m 50s) [16:46:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:47:50] (03PS1) 10Arturo Borrero Gonzalez: apt: unattended-upgrades: add comment about security upgrades [puppet] - 10https://gerrit.wikimedia.org/r/394080 [16:48:54] (03CR) 10Rush: [C: 031] apt: unattended-upgrades: add comment about security upgrades [puppet] - 10https://gerrit.wikimedia.org/r/394080 (owner: 10Arturo Borrero Gonzalez) [16:49:08] (03CR) 10Arturo Borrero Gonzalez: [C: 032] apt: unattended-upgrades: add comment about security upgrades [puppet] - 10https://gerrit.wikimedia.org/r/394080 (owner: 10Arturo Borrero Gonzalez) [16:55:22] (03PS4) 10Giuseppe Lavagetto: site.pp: remove import of realm.pp [puppet] - 10https://gerrit.wikimedia.org/r/393799 [16:57:42] (03PS1) 10Herron: puppet: point codfw cp servers at codfw puppet 4 masters [puppet] - 10https://gerrit.wikimedia.org/r/394084 (https://phabricator.wikimedia.org/T177254) [17:06:22] (03CR) 10Giuseppe Lavagetto: [C: 032] site.pp: remove import of realm.pp [puppet] - 10https://gerrit.wikimedia.org/r/393799 (owner: 10Giuseppe Lavagetto) [17:07:13] PROBLEM - puppet last run on mw2251 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 4 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/usr/local/bin/furl] [17:07:54] PROBLEM - puppet last run on mw2253 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 5 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/usr/local/bin/furl] [17:11:17] (03PS1) 10Herron: puppet: initial cut over of codfw cp text/upload to puppet 4 masters [puppet] - 10https://gerrit.wikimedia.org/r/394085 (https://phabricator.wikimedia.org/T177254) [17:14:35] (03PS3) 10Giuseppe Lavagetto: environment/future: remove redundant settings [puppet] - 10https://gerrit.wikimedia.org/r/394042 [17:15:30] (03CR) 10Giuseppe Lavagetto: [C: 032] environment/future: remove redundant settings [puppet] - 10https://gerrit.wikimedia.org/r/394042 (owner: 10Giuseppe Lavagetto) [17:16:58] (03PS2) 10Marostegui: filtered_tables: Add new columns [puppet] - 10https://gerrit.wikimedia.org/r/393725 (https://phabricator.wikimedia.org/T174569) [17:17:13] (03CR) 10Marostegui: "I think I will merge this for now then and we can keep working on it when needed" [puppet] - 10https://gerrit.wikimedia.org/r/393725 (https://phabricator.wikimedia.org/T174569) (owner: 10Marostegui) [17:17:23] (03CR) 10Herron: "https://puppet-compiler.wmflabs.org/compiler02/9074/" [puppet] - 10https://gerrit.wikimedia.org/r/394085 (https://phabricator.wikimedia.org/T177254) (owner: 10Herron) [17:17:31] (03PS2) 10Herron: puppet: initial cut over of codfw cp text/upload to puppet 4 masters [puppet] - 10https://gerrit.wikimedia.org/r/394085 (https://phabricator.wikimedia.org/T177254) [17:19:51] (03PS1) 10Jcrespo: mariadb: Pool db1110 with low load after ip change and reboot [mediawiki-config] - 10https://gerrit.wikimedia.org/r/394086 (https://phabricator.wikimedia.org/T181613) [17:19:53] !log beginning cut over of cp200[12] to codfw puppet 4 masters [17:19:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:20:25] (03CR) 10Herron: [C: 032] puppet: initial cut over of codfw cp text/upload to puppet 4 masters [puppet] - 10https://gerrit.wikimedia.org/r/394085 (https://phabricator.wikimedia.org/T177254) (owner: 10Herron) [17:22:17] (03CR) 10Jcrespo: [C: 032] mariadb: Pool db1110 with low load after ip change and reboot [mediawiki-config] - 10https://gerrit.wikimedia.org/r/394086 (https://phabricator.wikimedia.org/T181613) (owner: 10Jcrespo) [17:23:43] (03Merged) 10jenkins-bot: mariadb: Pool db1110 with low load after ip change and reboot [mediawiki-config] - 10https://gerrit.wikimedia.org/r/394086 (https://phabricator.wikimedia.org/T181613) (owner: 10Jcrespo) [17:27:02] (03CR) 10jenkins-bot: mariadb: Pool db1110 with low load after ip change and reboot [mediawiki-config] - 10https://gerrit.wikimedia.org/r/394086 (https://phabricator.wikimedia.org/T181613) (owner: 10Jcrespo) [17:29:15] (03PS10) 10Elukey: [WIP] profile::hadoop::common: add Prometheus JMX exporter configuration [puppet] - 10https://gerrit.wikimedia.org/r/394045 (https://phabricator.wikimedia.org/T177458) [17:32:54] RECOVERY - puppet last run on mw2253 is OK: OK: Puppet is currently enabled, last run 34 seconds ago with 0 failures [17:34:57] (03PS1) 10Chad: group1 to wmf.10 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/394089 [17:34:59] (03CR) 10Chad: [C: 04-2] group1 to wmf.10 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/394089 (owner: 10Chad) [17:37:13] RECOVERY - puppet last run on mw2251 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [17:37:26] (03PS11) 10Elukey: profile::hadoop::common: add Prometheus JMX exporter configuration [puppet] - 10https://gerrit.wikimedia.org/r/394045 (https://phabricator.wikimedia.org/T177458) [17:38:04] (03CR) 10Elukey: "Looks good from pcc's perspective: https://puppet-compiler.wmflabs.org/compiler02/9075/" [puppet] - 10https://gerrit.wikimedia.org/r/394045 (https://phabricator.wikimedia.org/T177458) (owner: 10Elukey) [17:38:39] (03PS12) 10Elukey: profile::hadoop::worker: add Prometheus JMX exporter configuration [puppet] - 10https://gerrit.wikimedia.org/r/394045 (https://phabricator.wikimedia.org/T177458) [17:40:15] !log bootstrapping restbase1012-a - T179422 [17:40:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:40:22] T179422: Reshape RESTBase Cassandra clusters - https://phabricator.wikimedia.org/T179422 [17:40:26] 10Operations, 10ORES, 10Release-Engineering-Team, 10Scoring-platform-team (Current), and 2 others: Git refusing to clone some ORES submodules - https://phabricator.wikimedia.org/T181552#3797253 (10awight) @thcipriani OK thank you for the workaround. I'll note that I don't have permissions to do that mysel... [17:40:32] !log jynus@tin Synchronized wmf-config/db-eqiad.php: Pool db1110 with low load (duration: 00m 48s) [17:40:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:40:38] (03PS1) 10Paladox: Revert "environment/future: remove redundant settings" [puppet] - 10https://gerrit.wikimedia.org/r/394095 [17:40:50] (03PS2) 10Paladox: Revert "environment/future: remove redundant settings" [puppet] - 10https://gerrit.wikimedia.org/r/394095 [17:40:52] (03CR) 10Filippo Giunchedi: [C: 031] profile::hadoop::worker: add Prometheus JMX exporter configuration [puppet] - 10https://gerrit.wikimedia.org/r/394045 (https://phabricator.wikimedia.org/T177458) (owner: 10Elukey) [17:44:28] (03PS5) 10Smalyshev: Create script for automatic reload of categories [puppet] - 10https://gerrit.wikimedia.org/r/392736 (https://phabricator.wikimedia.org/T173772) [17:45:10] (03CR) 10jerkins-bot: [V: 04-1] Create script for automatic reload of categories [puppet] - 10https://gerrit.wikimedia.org/r/392736 (https://phabricator.wikimedia.org/T173772) (owner: 10Smalyshev) [17:47:23] (03PS13) 10Elukey: profile::hadoop::worker: add Prometheus JMX exporter configuration [puppet] - 10https://gerrit.wikimedia.org/r/394045 (https://phabricator.wikimedia.org/T177458) [17:48:49] (03CR) 10Smalyshev: wdqs: schedule cronjob to reload categories (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/394021 (https://phabricator.wikimedia.org/T173772) (owner: 10Gehel) [17:49:18] (03CR) 10Elukey: "Just corrected a little thing, the environment files were not getting updated." [puppet] - 10https://gerrit.wikimedia.org/r/394045 (https://phabricator.wikimedia.org/T177458) (owner: 10Elukey) [17:50:14] (03PS6) 10Smalyshev: Create script for automatic reload of categories [puppet] - 10https://gerrit.wikimedia.org/r/392736 (https://phabricator.wikimedia.org/T173772) [17:50:59] (03CR) 10Filippo Giunchedi: [C: 031] prometheus: add varnish-canary job definition [puppet] - 10https://gerrit.wikimedia.org/r/394063 (owner: 10Ema) [17:52:28] (03PS2) 10Herron: puppet: point codfw cp servers at codfw puppet 4 masters [puppet] - 10https://gerrit.wikimedia.org/r/394084 (https://phabricator.wikimedia.org/T177254) [17:54:01] (03CR) 10Andrew Bogott: [C: 031] "I can confirm that this breaks puppet runs on e.g. clients of tools-puppetmaster-01. Probably has a prerequisite patch that doesn't prope" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/394095 (owner: 10Paladox) [17:54:35] we're getting a ton a puppet failures in beta cluster, can someone takea quick looksee? [17:54:52] started at 17:24 [17:56:04] (03Draft1) 10Paladox: role::ci::slave::browsertests: Fix $redis_port by adding string [puppet] - 10https://gerrit.wikimedia.org/r/394096 [17:56:07] (03PS2) 10Paladox: role::ci::slave::browsertests: Fix $redis_port by adding string [puppet] - 10https://gerrit.wikimedia.org/r/394096 [17:56:25] greg-g see -cloud [17:57:19] (03PS3) 10Herron: puppet: point codfw cp servers at codfw puppet 4 masters [puppet] - 10https://gerrit.wikimedia.org/r/394084 (https://phabricator.wikimedia.org/T177254) [17:58:00] (03PS3) 10Smalyshev: Enable wdqs-admins to restart nginx [puppet] - 10https://gerrit.wikimedia.org/r/393814 (https://phabricator.wikimedia.org/T181540) [17:59:04] (03CR) 10Herron: [C: 032] puppet: point codfw cp servers at codfw puppet 4 masters [puppet] - 10https://gerrit.wikimedia.org/r/394084 (https://phabricator.wikimedia.org/T177254) (owner: 10Herron) [18:02:12] (03PS1) 10Jcrespo: mariadb: Increase db1110 load [mediawiki-config] - 10https://gerrit.wikimedia.org/r/394097 (https://phabricator.wikimedia.org/T181613) [18:03:10] (03PS1) 10Andrew Bogott: puppetmaster::standalone: Get env_config ready for puppet 4 and cleanups [puppet] - 10https://gerrit.wikimedia.org/r/394098 [18:03:29] !log beginning cut over of codfw cp2* servers to codfw puppet 4 masters [18:03:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:04:04] !log awight@tin Started deploy [ores/deploy@532bd0b]: (non-production) Update ORES on new cluster [18:04:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:05:16] !log awight@tin Finished deploy [ores/deploy@532bd0b]: (non-production) Update ORES on new cluster (duration: 01m 11s) [18:05:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:06:03] (03CR) 10Jcrespo: [C: 032] mariadb: Increase db1110 load [mediawiki-config] - 10https://gerrit.wikimedia.org/r/394097 (https://phabricator.wikimedia.org/T181613) (owner: 10Jcrespo) [18:07:07] (03CR) 10Filippo Giunchedi: [C: 031] Add Prometheus exporter to openldap/labs [puppet] - 10https://gerrit.wikimedia.org/r/394025 (owner: 10Muehlenhoff) [18:07:53] (03CR) 10EBernhardson: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/394002 (owner: 10EBernhardson) [18:08:01] (03Merged) 10jenkins-bot: mariadb: Increase db1110 load [mediawiki-config] - 10https://gerrit.wikimedia.org/r/394097 (https://phabricator.wikimedia.org/T181613) (owner: 10Jcrespo) [18:08:11] (03CR) 10jenkins-bot: mariadb: Increase db1110 load [mediawiki-config] - 10https://gerrit.wikimedia.org/r/394097 (https://phabricator.wikimedia.org/T181613) (owner: 10Jcrespo) [18:08:18] (03CR) 10jerkins-bot: [V: 04-1] Revert "Revert "Deploy MjoLniR with new deploy repository"" [puppet] - 10https://gerrit.wikimedia.org/r/394002 (owner: 10EBernhardson) [18:09:54] 10Operations, 10ORES, 10Release-Engineering-Team, 10Scoring-platform-team (Current): Connection timeout from tin to new ores servers - https://phabricator.wikimedia.org/T181661#3797349 (10awight) [18:09:59] 10Operations, 10ORES, 10Release-Engineering-Team, 10Scoring-platform-team (Current): Connection timeout from tin to new ores servers - https://phabricator.wikimedia.org/T181661#3797349 (10awight) a:05awight>03None [18:10:04] !log jynus@tin Synchronized wmf-config/db-eqiad.php: Increase db1110 load (duration: 00m 53s) [18:10:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:10:32] (03PS2) 10EBernhardson: Revert "Revert "Deploy MjoLniR with new deploy repository"" [puppet] - 10https://gerrit.wikimedia.org/r/394002 [18:12:56] !log Compress s5 on db1096 - T178359 [18:13:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:13:04] T178359: Support multi-instance on core hosts - https://phabricator.wikimedia.org/T178359 [18:14:05] 10Operations, 10ORES, 10Release-Engineering-Team, 10Scoring-platform-team (Current), and 2 others: Git refusing to clone some ORES submodules - https://phabricator.wikimedia.org/T181552#3797381 (10awight) a:05awight>03None [18:14:13] (03PS2) 10Andrew Bogott: puppetmaster::standalone: Get env_config ready for puppet 4 and cleanups [puppet] - 10https://gerrit.wikimedia.org/r/394098 [18:16:20] (03PS3) 10Andrew Bogott: puppetmaster::standalone: Get env_config ready for puppet 4 and cleanups [puppet] - 10https://gerrit.wikimedia.org/r/394098 [18:16:38] (03CR) 10Paladox: [C: 031] puppetmaster::standalone: Get env_config ready for puppet 4 and cleanups [puppet] - 10https://gerrit.wikimedia.org/r/394098 (owner: 10Andrew Bogott) [18:17:03] (03PS4) 10Andrew Bogott: puppetmaster::standalone: Get env_config ready for puppet 4 and cleanups [puppet] - 10https://gerrit.wikimedia.org/r/394098 [18:17:37] (03CR) 10Dzahn: [C: 031] "thx, lgtm, just needs meeting approval like Gehel said" [puppet] - 10https://gerrit.wikimedia.org/r/393814 (https://phabricator.wikimedia.org/T181540) (owner: 10Smalyshev) [18:17:52] (03CR) 10Andrew Bogott: [C: 032] puppetmaster::standalone: Get env_config ready for puppet 4 and cleanups [puppet] - 10https://gerrit.wikimedia.org/r/394098 (owner: 10Andrew Bogott) [18:19:57] 10Operations, 10Ops-Access-Requests, 10Discovery, 10Wikidata, and 2 others: Enable wdqs-admin's to control nginx - https://phabricator.wikimedia.org/T181540#3797408 (10Dzahn) +1 and patch looks good to me now (I encouraged using systemctl instead/besides the service command). It will need approval in nex... [18:20:40] 10Operations, 10Ops-Access-Requests, 10Discovery, 10Wikidata, and 2 others: Enable wdqs-admin's to control nginx - https://phabricator.wikimedia.org/T181540#3797409 (10Dzahn) 05Open>03stalled [18:21:09] (03Abandoned) 10Andrew Bogott: Revert "environment/future: remove redundant settings" [puppet] - 10https://gerrit.wikimedia.org/r/394095 (owner: 10Paladox) [18:21:44] (03CR) 10Jcrespo: [C: 031] "I couldn't do it, something came by, I will try tomorrow." [puppet] - 10https://gerrit.wikimedia.org/r/375347 (https://phabricator.wikimedia.org/T113842) (owner: 10Tpt) [18:22:40] (03CR) 10Dzahn: [C: 04-1] "pending ops meeting approval next Monday" [puppet] - 10https://gerrit.wikimedia.org/r/393988 (https://phabricator.wikimedia.org/T181479) (owner: 10Dzahn) [18:24:02] 10Operations, 10Ops-Access-Requests, 10Patch-For-Review: Requesting access to terbium/wasat for Trey Jones - https://phabricator.wikimedia.org/T181479#3797413 (10Dzahn) 05Open>03stalled This ticket is now just pending approval in the weekly ops meeting which is next Monday. [18:24:33] <_joe_> andrewbogott: well done :) [18:24:40] <_joe_> sorry I was in a meeting until now [18:29:43] (03CR) 10Dzahn: "shell access should be based on roles, let's not "hard code" host names, right @joe" [puppet] - 10https://gerrit.wikimedia.org/r/393994 (owner: 10Dzahn) [18:30:46] (03PS1) 10Arturo Borrero Gonzalez: role::puppetmaster::standalone: add ferm rules to allow connecting to tcp/8140 [puppet] - 10https://gerrit.wikimedia.org/r/394101 (https://phabricator.wikimedia.org/T154150) [18:31:06] (03CR) 10jerkins-bot: [V: 04-1] role::puppetmaster::standalone: add ferm rules to allow connecting to tcp/8140 [puppet] - 10https://gerrit.wikimedia.org/r/394101 (https://phabricator.wikimedia.org/T154150) (owner: 10Arturo Borrero Gonzalez) [18:32:45] 10Operations, 10Ops-Access-Requests, 10Patch-For-Review: Requesting access to terbium/wasat for Trey Jones - https://phabricator.wikimedia.org/T181479#3797436 (10TJones) @Dzhan, thanks for getting this moving so quickly! Happy to wait for Monday's ops meeting. [18:34:58] (03PS2) 10Arturo Borrero Gonzalez: role::puppetmaster::standalone: add ferm rules to allow connecting to tcp/8140 [puppet] - 10https://gerrit.wikimedia.org/r/394101 (https://phabricator.wikimedia.org/T154150) [18:35:19] (03CR) 10jerkins-bot: [V: 04-1] role::puppetmaster::standalone: add ferm rules to allow connecting to tcp/8140 [puppet] - 10https://gerrit.wikimedia.org/r/394101 (https://phabricator.wikimedia.org/T154150) (owner: 10Arturo Borrero Gonzalez) [18:37:05] (03PS3) 10Arturo Borrero Gonzalez: role::puppetmaster::standalone: add ferm rules to allow connecting to tcp/8140 [puppet] - 10https://gerrit.wikimedia.org/r/394101 (https://phabricator.wikimedia.org/T154150) [18:37:21] ^^^ me fighting with indentations [18:37:27] (03CR) 10jerkins-bot: [V: 04-1] role::puppetmaster::standalone: add ferm rules to allow connecting to tcp/8140 [puppet] - 10https://gerrit.wikimedia.org/r/394101 (https://phabricator.wikimedia.org/T154150) (owner: 10Arturo Borrero Gonzalez) [18:47:53] (03PS1) 10Dzahn: admins: new group wikidata-admins, add on canaries,maint [puppet] - 10https://gerrit.wikimedia.org/r/394102 (https://phabricator.wikimedia.org/T179317) [18:48:03] 10Operations, 10ORES, 10Release-Engineering-Team, 10Scoring-platform-team (Current): Connection timeout from tin to new ores servers - https://phabricator.wikimedia.org/T181661#3797349 (10mmodell) This is very strange. I can't tell exactly what would be causing this. [18:49:04] arturo: that error is strange, like it's not caused by you .. [18:49:39] you are just adding a ferm::service to a role, would be common, but "modules/role/manifests/puppetmaster/standalone.pp:1 wmf-style: role 'role::puppetmaster::standalone' should not include defines" eh... [18:50:17] mutante: yeah [18:51:56] (03PS1) 10MaxSem: Switch all wikis to HTML5 section IDs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/394104 (https://phabricator.wikimedia.org/T152540) [18:52:23] PROBLEM - Disk space on graphite2002 is CRITICAL: DISK CRITICAL - free space: /var/lib/carbon 82856 MB (3% inode=97%) [18:52:45] arturo: ah, right, it should be in the profile now [18:53:18] but puppetmaster isnt converted to profiles yet [18:53:23] (03CR) 10Rush: [C: 031] "I think is probably an override for jenkins candidate unfortunately. My eyeball says it should be fine." [puppet] - 10https://gerrit.wikimedia.org/r/394101 (https://phabricator.wikimedia.org/T154150) (owner: 10Arturo Borrero Gonzalez) [18:53:48] mutante, (CC chasemp) yes, we are aware of that [18:54:01] what chasemp said, this is probably a candidate to override it [18:54:42] (03PS2) 10Herron: puppet: point codfw lvs servers at codfw puppet 4 masters [puppet] - 10https://gerrit.wikimedia.org/r/393670 (https://phabricator.wikimedia.org/T177254) [18:54:43] great [18:56:16] oh, one sec, checking, there ARE already profiles [18:57:41] !log beginning cutover of codfw lvs2* systems to codfw puppet 4 masters. first standby nodes, then active nodes [18:57:41] for prod puppetmaster, but not puppetmaster::standlone.. gotcha [18:57:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:58:22] (03CR) 10Herron: [C: 032] puppet: point codfw lvs servers at codfw puppet 4 masters [puppet] - 10https://gerrit.wikimedia.org/r/393670 (https://phabricator.wikimedia.org/T177254) (owner: 10Herron) [18:58:56] mutante: yes [18:59:21] we should make role::puppetmaster::standalone use a newly created modules/profile/manifests/puppetmaster/standalone.pp (next to existing backend.pp and frontend.pp). that would be the real fix, but that shouldn't block this one [19:00:04] addshore, hashar, anomie, RainbowSprinkles, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, thcipriani, Niharika, and zeljkof: Time to snap out of that daydream and deploy Morning SWAT (Max 8 patches). Get on with it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20171129T1900). [19:00:04] kaldari and Jhs: A patch you scheduled for Morning SWAT (Max 8 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [19:04:05] I'm around! [19:04:35] I'll deploy [19:04:38] o/ [19:05:47] (03PS3) 10MaxSem: Add "Prosjekt" namespace to nowikimedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/394059 (https://phabricator.wikimedia.org/T181625) (owner: 10Jon Harald Søby) [19:05:51] (03CR) 10MaxSem: [C: 032] Add "Prosjekt" namespace to nowikimedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/394059 (https://phabricator.wikimedia.org/T181625) (owner: 10Jon Harald Søby) [19:06:15] _joe_: https://gerrit.wikimedia.org/r/#/c/393799/ seems to have broken puppet in deployment-prep where it's looking for nameservers defnition that came from realm I suspect? thcipriani andrewbogott [19:06:27] kaldari's patch isn't going live today btw [19:06:34] <_joe_> chasemp: andrewbogott made a patch for that [19:06:45] ok, so it's just me then, MaxSem ? [19:06:46] <_joe_> chasemp: I guess you just need to merge it? [19:06:50] maybe it just hasn't landed there yet? [19:06:55] yup Jhs [19:06:57] hoo: when it says "varnish logs" in the access request, does it translate to the actual "varnishlog" command to see backend health or to "varnishncsa" to see request logs? [19:07:05] <_joe_> sorry, going off now [19:07:34] could be a chicken and egg thing as the master itself [19:07:36] Error: Could not retrieve catalog from remote server: Error 400 on SERVER: Variable $::nameservers is not defined! at /etc/puppet/modules/base/manifests/resolving.pp:6 on node deployment-puppetmaster02.deployment-prep.eqiad.wmflabs [19:07:36] Warning: Not using cache on failed catalog [19:07:37] Error: Could not retrieve catalog; skipping run [19:07:55] (03Merged) 10jenkins-bot: Add "Prosjekt" namespace to nowikimedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/394059 (https://phabricator.wikimedia.org/T181625) (owner: 10Jon Harald Søby) [19:08:00] MaxSem, ok. i've only done this once before, so i might need some hand-holding [19:08:39] thcipriani: this is a mess and I'm unsure atm why Toolforge is ok and deployment-prep is broken [19:09:03] andrew was transitioning work spaces so if we can stand it let's wait to ask him before making possibly uneeded changes [19:09:12] 10Operations, 10Ops-Access-Requests, 10Patch-For-Review, 10Performance-Team (Radar): Varnish and Apache root for hoo - https://phabricator.wikimedia.org/T179317#3720398 (10Dzahn) >>! In T179317#3788865, @ArielGlenn wrote: > - varnish log access on the varnish hosts Does this translate to the actual "varn... [19:09:17] rescheduled my deployment for tomorrow [19:09:45] chasemp: k, I just stumbled into some shinken alerts, so I'm fine waiting for andrew [19:10:20] thcipriani: I have a suspcian andrewbogott had to hand fix the tools puppetmaster to get over the chicken-egg-puppet-fixes-puppet issue [19:11:14] thcipriani: https://gerrit.wikimedia.org/r/#/c/394098/4/modules/role/manifests/puppetmaster/standalone.pp [19:12:03] (03PS7) 10Smalyshev: Create script for automatic reload of categories [puppet] - 10https://gerrit.wikimedia.org/r/392736 (https://phabricator.wikimedia.org/T173772) [19:12:29] Jhs: your patch is live on mwdebug1002, please test [19:12:48] <_joe_> chasemp: yeah you prolly need to fix manually the puppetmaster. [19:12:56] thcipriani: well uh puppet checkout on the msate there seems pretty old? [19:12:57] <_joe_> I can do it later, now I'm at dinner, sorry [19:13:19] _joe_: yeah this is not an emergency at all, we'll circle back if it's an actual mystery rather than a mess [19:13:24] go enjoy dinner [19:13:52] MaxSem, everything seems to be working correctly. haven't tried save, but i can't imagine why that would fail [19:14:11] * thcipriani fiddles [19:14:23] (03CR) 10jenkins-bot: Add "Prosjekt" namespace to nowikimedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/394059 (https://phabricator.wikimedia.org/T181625) (owner: 10Jon Harald Søby) [19:14:57] 10Operations, 10Ops-Access-Requests, 10Patch-For-Review, 10Performance-Team (Radar): Varnish and Apache root for hoo - https://phabricator.wikimedia.org/T179317#3797677 (10Dzahn) >>! In T179317#3788865, @ArielGlenn wrote: > on the mw canaries role::mediawiki::canary_appserver (https://gerrit.wikimedia.org... [19:15:21] !log maxsem@tin Synchronized wmf-config/InitialiseSettings.php: https://gerrit.wikimedia.org/r/#/c/394059/3 (duration: 00m 49s) [19:15:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:15:41] (03PS8) 10Smalyshev: Create script for automatic reload of categories [puppet] - 10https://gerrit.wikimedia.org/r/392736 (https://phabricator.wikimedia.org/T173772) [19:16:05] Jhs: ^ [19:16:13] mutante: varnishncsa + varnishlog, I suppose [19:16:45] hoo: ah, both. ok! i am creating a new group [19:16:49] (03PS9) 10Smalyshev: Create script for automatic reload of categories [puppet] - 10https://gerrit.wikimedia.org/r/392736 (https://phabricator.wikimedia.org/T173772) [19:17:24] figuring out privileges [19:17:54] MaxSem, should it be live now? [19:18:03] yes [19:18:39] ah, yes, there i see it. confirmed :) [19:19:25] wee [19:21:49] thanks MaxSem :) [19:24:04] ( _joe_ hopefully enjoying dinner just fyi all is well in deployment-prep ) [19:28:44] (03PS2) 10Dzahn: admins: new group wikidata-admins, add on canaries,maint [puppet] - 10https://gerrit.wikimedia.org/r/394102 (https://phabricator.wikimedia.org/T179317) [19:28:55] (03PS1) 10Herron: dns: restore puppet.codfw.wmnet CNAME puppetmaster2001.eqiad.wmnet [dns] - 10https://gerrit.wikimedia.org/r/394110 (https://phabricator.wikimedia.org/T177254) [19:29:17] (03CR) 10jerkins-bot: [V: 04-1] dns: restore puppet.codfw.wmnet CNAME puppetmaster2001.eqiad.wmnet [dns] - 10https://gerrit.wikimedia.org/r/394110 (https://phabricator.wikimedia.org/T177254) (owner: 10Herron) [19:30:12] (03PS2) 10Herron: dns: restore puppet.codfw.wmnet CNAME puppetmaster2001.eqiad.wmnet [dns] - 10https://gerrit.wikimedia.org/r/394110 (https://phabricator.wikimedia.org/T177254) [19:32:29] (03PS3) 10Herron: dns: restore puppet.codfw.wmnet CNAME puppetmaster2001.eqiad.wmnet [dns] - 10https://gerrit.wikimedia.org/r/394110 (https://phabricator.wikimedia.org/T177254) [19:33:08] (03PS4) 10Herron: dns: restore puppet.codfw.wmnet CNAME puppetmaster2001.codfw.wmnet [dns] - 10https://gerrit.wikimedia.org/r/394110 (https://phabricator.wikimedia.org/T177254) [19:34:37] 10Operations, 10Ops-Access-Requests, 10Patch-For-Review, 10Performance-Team (Radar): Varnish and Apache root for hoo - https://phabricator.wikimedia.org/T179317#3797732 (10Dzahn) >>! In T179317#3788865, @ArielGlenn wrote: > - strace and tcpdump 'ALL = NOPASSWD: /usr/bin/strace *', 'ALL = NOPASSWD: /usr/s... [19:36:44] 10Operations, 10Ops-Access-Requests, 10Patch-For-Review, 10Performance-Team (Radar): Varnish and Apache root for hoo - https://phabricator.wikimedia.org/T179317#3797736 (10Dzahn) a:03Dzahn [19:36:54] (03CR) 10Chad: [C: 031] "Let's land this." [puppet] - 10https://gerrit.wikimedia.org/r/391865 (https://phabricator.wikimedia.org/T171758) (owner: 10Paladox) [19:37:49] (03CR) 10Chad: "Is there a reason we don't have www.mediawiki.org here?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/392999 (https://phabricator.wikimedia.org/T117302) (owner: 10TerraCodes) [19:42:17] (03PS6) 10Dzahn: Gerrit: Set auth.gitBasicAuth to true [puppet] - 10https://gerrit.wikimedia.org/r/391865 (https://phabricator.wikimedia.org/T171758) (owner: 10Paladox) [19:43:02] (03PS1) 10Herron: puppet: cut over eqiad scb hosts to codfw puppet 4 masters [puppet] - 10https://gerrit.wikimedia.org/r/394119 (https://phabricator.wikimedia.org/T177254) [19:49:32] (03CR) 10Dzahn: "talked about it one more time, yea, so this will mean if people want to upload via https they HAVE to use the random http password and not" [puppet] - 10https://gerrit.wikimedia.org/r/391865 (https://phabricator.wikimedia.org/T171758) (owner: 10Paladox) [19:49:57] (03CR) 10Dzahn: [C: 032] Gerrit: Set auth.gitBasicAuth to true [puppet] - 10https://gerrit.wikimedia.org/r/391865 (https://phabricator.wikimedia.org/T171758) (owner: 10Paladox) [19:50:04] thanks mutante :) [19:50:13] (03CR) 10Dzahn: [C: 032] "to unblock LFS support which other teams are waiting for" [puppet] - 10https://gerrit.wikimedia.org/r/391865 (https://phabricator.wikimedia.org/T171758) (owner: 10Paladox) [19:50:22] requires a gerrit restart for it to take effect [19:50:57] !log beginning rolling cut over of eqiad scb hosts to codfw puppet 4 masters [19:51:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:52:17] (03CR) 10Herron: [C: 032] puppet: cut over eqiad scb hosts to codfw puppet 4 masters [puppet] - 10https://gerrit.wikimedia.org/r/394119 (https://phabricator.wikimedia.org/T177254) (owner: 10Herron) [19:52:23] (03PS2) 10Herron: puppet: cut over eqiad scb hosts to codfw puppet 4 masters [puppet] - 10https://gerrit.wikimedia.org/r/394119 (https://phabricator.wikimedia.org/T177254) [19:53:37] herron: i'd like to restart gerrit but i can wait until you have something merged. [19:54:06] mutante all good here! [19:54:34] ok, cool :) [19:55:12] !log restarting gerrit to apply config change and set gitBasicAuth to true to unblock T171758 (gerrit:391865) [19:55:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:55:22] T171758: Support git-lfs files in gerrit - https://phabricator.wikimedia.org/T171758 [19:55:47] grr [19:55:51] bad mutante :P [19:56:12] sorry Reedy, i need a "wall" notification across all channels [19:56:18] :D [19:56:25] back [19:56:56] mutante: isnt there a cmd to do that like asay? [19:57:14] no_justification: ^ deployed .. the git-lfs stuff might work now [19:57:14] if he was netops, he could tell everyone on the network [19:57:17] And piss many people off [19:57:19] no_justification i think lfs may work now [19:57:26] :) [19:57:34] ~/gerrit-ping$ git lfs push origin master [19:57:34] Git LFS: (1 of 1 files) 42 B / 42 B [19:57:45] Zppix: maybe wm-bot would be the right thing for that :) [19:57:54] wm-bot: spam --all [19:58:02] though i doint see the change [19:58:16] well, i can confirm the config changed on cobalt [19:59:23] paladox: 42 byte too small for "large" file support ? :P j/k [19:59:33] i was testing on a php file [19:59:40] i am going to test on a bat file as a test [19:59:45] https://github.com/git-lfs/git-lfs/wiki/Tutorial [20:00:05] no_justification: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) MediaWiki train deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20171129T2000). [20:00:05] No GERRIT patches in the queue for this window AFAICS. [20:00:30] yep, No GERRIT patches, so we patched Gerrit instead [20:05:41] !log restarting cassandra bootstrap of restbase1012-a (T179422) [20:05:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:05:49] T179422: Reshape RESTBase Cassandra clusters - https://phabricator.wikimedia.org/T179422 [20:05:53] woo hoo [20:05:58] no_justification mutante it works [20:06:03] see https://gerrit.wikimedia.org/r/#/c/394125/2/cat.bin [20:09:56] 10Operations, 10Gerrit, 10ORES, 10Scoring-platform-team, and 3 others: Support git-lfs files in gerrit - https://phabricator.wikimedia.org/T171758#3797904 (10Paladox) git-lfs is now supported :). See https://gerrit.wikimedia.org/r/#/c/394125/ [20:11:26] (03CR) 10Dzahn: Bird: add monitoring to the VIP and bird process (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/393668 (https://phabricator.wikimedia.org/T98006) (owner: 10Ayounsi) [20:14:07] paladox: :) [20:14:15] awight: ^ git-lfs, see paladox' link [20:14:21] :) [20:20:58] (03PS2) 10Smalyshev: wdqs: schedule cronjob to reload categories [puppet] - 10https://gerrit.wikimedia.org/r/394021 (https://phabricator.wikimedia.org/T173772) (owner: 10Gehel) [20:21:25] (03CR) 10jerkins-bot: [V: 04-1] wdqs: schedule cronjob to reload categories [puppet] - 10https://gerrit.wikimedia.org/r/394021 (https://phabricator.wikimedia.org/T173772) (owner: 10Gehel) [20:21:30] (03CR) 10Chad: [C: 032] group1 to wmf.10 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/394089 (owner: 10Chad) [20:23:34] (03Merged) 10jenkins-bot: group1 to wmf.10 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/394089 (owner: 10Chad) [20:23:50] (03CR) 10jenkins-bot: group1 to wmf.10 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/394089 (owner: 10Chad) [20:26:49] !log demon@tin Synchronized php: symlink bump (duration: 00m 48s) [20:26:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:29:42] !log demon@tin rebuilt wikiversions.php and synchronized wikiversions files: group1 to wmf.10 [20:29:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:31:27] 10Operations, 10ORES, 10Scap, 10Scoring-platform-team, 10Release-Engineering-Team (Watching / External): ORES should use a git large file plugin for storing serialized binaries - https://phabricator.wikimedia.org/T171619#3797980 (10demon) [20:31:33] 10Operations, 10Gerrit, 10ORES, 10Scoring-platform-team, and 3 others: Support git-lfs files in gerrit - https://phabricator.wikimedia.org/T171758#3797978 (10demon) 05Open>03Resolved a:03demon [20:35:44] (03PS10) 10Gehel: Create script for automatic reload of categories [puppet] - 10https://gerrit.wikimedia.org/r/392736 (https://phabricator.wikimedia.org/T173772) (owner: 10Smalyshev) [20:36:37] (03CR) 10Gehel: [C: 032] Create script for automatic reload of categories [puppet] - 10https://gerrit.wikimedia.org/r/392736 (https://phabricator.wikimedia.org/T173772) (owner: 10Smalyshev) [20:37:07] 10Operations, 10Gerrit, 10ORES, 10Scoring-platform-team, and 2 others: Plan migration of ORES repos to git-lfs - https://phabricator.wikimedia.org/T181678#3798012 (10awight) [20:39:57] 10Operations, 10Gerrit, 10ORES, 10Scoring-platform-team, and 2 others: Plan migration of ORES repos to git-lfs - https://phabricator.wikimedia.org/T181678#3798028 (10awight) [20:40:33] (03PS1) 10Smalyshev: Fix reloadCategories.sh - nginx needs semicolon in maps [puppet] - 10https://gerrit.wikimedia.org/r/394134 [20:41:38] (03CR) 10Gehel: [C: 032] Fix reloadCategories.sh - nginx needs semicolon in maps [puppet] - 10https://gerrit.wikimedia.org/r/394134 (owner: 10Smalyshev) [20:43:03] 10Operations, 10Ops-Access-Requests: Requesting access to terbium.eqiad.wmnet for cparle - https://phabricator.wikimedia.org/T181626#3796459 (10Dzahn) Confirmed that L3 has already been signed by cparle. A user exists in admin module but in the "ldap_only_users" section. Adding real shell access means having... [20:44:33] (03PS1) 10Marostegui: s5.hosts: Update port for db1096 [software] - 10https://gerrit.wikimedia.org/r/394135 [20:44:59] <_joe_> awight: /win 24 [20:45:02] <_joe_> argh [20:45:05] <_joe_> nevermind :P [20:45:23] _joe_: You run 24-bit windows lol [20:45:51] (03CR) 10Marostegui: [C: 032] s5.hosts: Update port for db1096 [software] - 10https://gerrit.wikimedia.org/r/394135 (owner: 10Marostegui) [20:46:07] <_joe_> ahah no, I did mistype "w" as "aw" and autocompletion did the rest :P [20:46:36] (03Merged) 10jenkins-bot: s5.hosts: Update port for db1096 [software] - 10https://gerrit.wikimedia.org/r/394135 (owner: 10Marostegui) [20:46:37] 10Operations, 10Gerrit, 10ORES, 10Scoring-platform-team, and 3 others: Support git-lfs files in gerrit - https://phabricator.wikimedia.org/T171758#3798085 (10awight) [20:46:43] 10Operations, 10ORES, 10Scap, 10Scoring-platform-team, 10Release-Engineering-Team (Watching / External): ORES should use a git large file plugin for storing serialized binaries - https://phabricator.wikimedia.org/T171619#3798086 (10awight) [20:46:47] 10Operations, 10Gerrit, 10ORES, 10Scoring-platform-team, and 2 others: Plan migration of ORES repos to git-lfs - https://phabricator.wikimedia.org/T181678#3798084 (10awight) [20:46:55] (03PS1) 10Smalyshev: Fix gui-vars location [puppet] - 10https://gerrit.wikimedia.org/r/394139 [20:47:40] 10Operations, 10Ops-Access-Requests: Requesting access to terbium.eqiad.wmnet for cparle - https://phabricator.wikimedia.org/T181626#3798091 (10Dzahn) @Gilles @Cparle Or would it make sense to just add the maintenance command to a cronjob and let it run automatically at fixed intervals? Does it really need to... [20:48:27] (03CR) 10Gehel: [C: 032] Fix gui-vars location [puppet] - 10https://gerrit.wikimedia.org/r/394139 (owner: 10Smalyshev) [20:49:26] 10Operations, 10ORES, 10Scap, 10Scoring-platform-team, 10Release-Engineering-Team (Watching / External): ORES should use a git large file plugin for storing serialized binaries - https://phabricator.wikimedia.org/T171619#3798097 (10Paladox) 05stalled>03Open [20:50:53] PROBLEM - puppet last run on wdqs2003 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [20:51:36] (03PS1) 10Smalyshev: Fix another place where gui-vars location was wrong [puppet] - 10https://gerrit.wikimedia.org/r/394141 [20:51:57] 10Operations, 10Gerrit, 10ORES, 10Scoring-platform-team, and 2 others: Plan migration of ORES repos to git-lfs - https://phabricator.wikimedia.org/T181678#3798104 (10awight) I'm guessing we want to do something like, # Copy repos to a read-only location. # Set LFS flags and metadata on repo (unknown) # git... [20:52:19] (03CR) 10Gehel: [C: 032] Fix another place where gui-vars location was wrong [puppet] - 10https://gerrit.wikimedia.org/r/394141 (owner: 10Smalyshev) [20:52:33] PROBLEM - puppet last run on wdqs1004 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [20:55:53] RECOVERY - puppet last run on wdqs2003 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [20:56:04] PROBLEM - puppet last run on wdqs1003 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [20:57:23] (03PS3) 10Smalyshev: wdqs: schedule cronjob to reload categories [puppet] - 10https://gerrit.wikimedia.org/r/394021 (https://phabricator.wikimedia.org/T173772) (owner: 10Gehel) [20:57:45] (03CR) 10jerkins-bot: [V: 04-1] wdqs: schedule cronjob to reload categories [puppet] - 10https://gerrit.wikimedia.org/r/394021 (https://phabricator.wikimedia.org/T173772) (owner: 10Gehel) [20:59:09] (03PS3) 10Ayounsi: Bird: add monitoring to the VIP and bird process [puppet] - 10https://gerrit.wikimedia.org/r/393668 (https://phabricator.wikimedia.org/T98006) [20:59:59] (03PS4) 10Andrew Bogott: nova-network dnsmasq: set a deployment-appropriate cname for 'puppet' [puppet] - 10https://gerrit.wikimedia.org/r/393841 (https://phabricator.wikimedia.org/T181375) [21:00:01] (03PS6) 10Andrew Bogott: labsaliaser: handle requests for the simple hostname 'puppet' [puppet] - 10https://gerrit.wikimedia.org/r/393842 (https://phabricator.wikimedia.org/T181375) [21:00:04] gwicke, cscott, arlolra, subbu, bearND, halfak, and Amir1: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) Services – Parsoid / Citoid / Mobileapps / ORES / … deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20171129T2100). [21:00:04] No GERRIT patches in the queue for this window AFAICS. [21:00:55] 10Operations, 10ORES, 10Release-Engineering-Team, 10Scoring-platform-team (Current): Connection timeout from tin to new ores servers - https://phabricator.wikimedia.org/T181661#3798137 (10thcipriani) Hrm. I think this error probably has something to do with ssh client timeout. I'm not sure if anything rece... [21:01:03] RECOVERY - puppet last run on wdqs1003 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [21:01:47] 10Operations, 10Gerrit, 10ORES, 10Scoring-platform-team, and 2 others: Plan migration of ORES repos to git-lfs - https://phabricator.wikimedia.org/T181678#3798142 (10demon) They'll mirror just fine since Phabricator just observes upstream. [21:02:03] (03PS1) 10Ottomata: Puppetize SSL for Kafka broker [puppet] - 10https://gerrit.wikimedia.org/r/394144 (https://phabricator.wikimedia.org/T166167) [21:02:27] (03CR) 10jerkins-bot: [V: 04-1] Puppetize SSL for Kafka broker [puppet] - 10https://gerrit.wikimedia.org/r/394144 (https://phabricator.wikimedia.org/T166167) (owner: 10Ottomata) [21:05:48] no parsoid deploy today [21:11:29] (03PS1) 10Dzahn: sca/scb: remove ganglia [puppet] - 10https://gerrit.wikimedia.org/r/394151 (https://phabricator.wikimedia.org/T177225) [21:14:24] (03CR) 10Dzahn: [C: 032] sca/scb: remove ganglia [puppet] - 10https://gerrit.wikimedia.org/r/394151 (https://phabricator.wikimedia.org/T177225) (owner: 10Dzahn) [21:15:39] 10Operations, 10Release Pipeline, 10Release-Engineering-Team (Kanban), 10User-Joe: Upgrade latest docker-registry.wikimedia.org/nodejs-devel to stretch - https://phabricator.wikimedia.org/T180524#3798234 (10Joe) p:05Triage>03High [21:17:33] RECOVERY - puppet last run on wdqs1004 is OK: OK: Puppet is currently enabled, last run 25 seconds ago with 0 failures [21:18:04] (03CR) 10Chad: [C: 032] Remove www.*.org symlinks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/391355 (owner: 10Chad) [21:20:34] (03Merged) 10jenkins-bot: Remove www.*.org symlinks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/391355 (owner: 10Chad) [21:21:49] !log demon@tin Synchronized docroot/: (no justification provided) (duration: 00m 49s) [21:21:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:22:53] 10Operations, 10Ops-Access-Requests: Requesting access to terbium.eqiad.wmnet for cparle - https://phabricator.wikimedia.org/T181626#3798255 (10Gilles) IMHO manual access is necessary in case it doesn't work as expected, etc. It's always convenient to be able to eval things as prod mediawiki and so on when wor... [21:23:38] (03PS1) 10Dzahn: wdqs,thumbor: remove ganglia [puppet] - 10https://gerrit.wikimedia.org/r/394153 (https://phabricator.wikimedia.org/T177225) [21:23:43] !log docroot sync was for If1afa59a [21:23:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:24:14] (03PS4) 10Smalyshev: wdqs: schedule cronjob to reload categories [puppet] - 10https://gerrit.wikimedia.org/r/394021 (https://phabricator.wikimedia.org/T173772) (owner: 10Gehel) [21:24:37] (03CR) 10jerkins-bot: [V: 04-1] wdqs: schedule cronjob to reload categories [puppet] - 10https://gerrit.wikimedia.org/r/394021 (https://phabricator.wikimedia.org/T173772) (owner: 10Gehel) [21:24:55] (03PS5) 10Smalyshev: wdqs: schedule cronjob to reload categories [puppet] - 10https://gerrit.wikimedia.org/r/394021 (https://phabricator.wikimedia.org/T173772) (owner: 10Gehel) [21:25:19] (03CR) 10jerkins-bot: [V: 04-1] wdqs: schedule cronjob to reload categories [puppet] - 10https://gerrit.wikimedia.org/r/394021 (https://phabricator.wikimedia.org/T173772) (owner: 10Gehel) [21:26:01] (03CR) 10Smalyshev: wdqs: schedule cronjob to reload categories (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/394021 (https://phabricator.wikimedia.org/T173772) (owner: 10Gehel) [21:28:35] (03PS2) 10Ottomata: Puppetize SSL for Kafka broker [puppet] - 10https://gerrit.wikimedia.org/r/394144 (https://phabricator.wikimedia.org/T166167) [21:29:01] (03CR) 10jerkins-bot: [V: 04-1] Puppetize SSL for Kafka broker [puppet] - 10https://gerrit.wikimedia.org/r/394144 (https://phabricator.wikimedia.org/T166167) (owner: 10Ottomata) [21:29:10] (03PS6) 10Smalyshev: wdqs: schedule cronjob to reload categories [puppet] - 10https://gerrit.wikimedia.org/r/394021 (https://phabricator.wikimedia.org/T173772) (owner: 10Gehel) [21:29:33] (03CR) 10jerkins-bot: [V: 04-1] wdqs: schedule cronjob to reload categories [puppet] - 10https://gerrit.wikimedia.org/r/394021 (https://phabricator.wikimedia.org/T173772) (owner: 10Gehel) [21:31:06] (03PS3) 10Ottomata: Puppetize SSL for Kafka broker [puppet] - 10https://gerrit.wikimedia.org/r/394144 (https://phabricator.wikimedia.org/T166167) [21:31:35] (03CR) 10jerkins-bot: [V: 04-1] Puppetize SSL for Kafka broker [puppet] - 10https://gerrit.wikimedia.org/r/394144 (https://phabricator.wikimedia.org/T166167) (owner: 10Ottomata) [21:31:43] (03PS7) 10Smalyshev: wdqs: schedule cronjob to reload categories [puppet] - 10https://gerrit.wikimedia.org/r/394021 (https://phabricator.wikimedia.org/T173772) (owner: 10Gehel) [21:32:34] (03PS1) 10Cooltey: Add shell accounts cooltey and sharvaniharan [puppet] - 10https://gerrit.wikimedia.org/r/394163 (https://phabricator.wikimedia.org/T173886) [21:33:14] (03CR) 10jerkins-bot: [V: 04-1] Add shell accounts cooltey and sharvaniharan [puppet] - 10https://gerrit.wikimedia.org/r/394163 (https://phabricator.wikimedia.org/T173886) (owner: 10Cooltey) [21:33:31] (03PS4) 10Ottomata: Puppetize SSL for Kafka broker [puppet] - 10https://gerrit.wikimedia.org/r/394144 (https://phabricator.wikimedia.org/T166167) [21:33:51] (03CR) 10jenkins-bot: Remove www.*.org symlinks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/391355 (owner: 10Chad) [21:34:01] (03CR) 10jerkins-bot: [V: 04-1] Puppetize SSL for Kafka broker [puppet] - 10https://gerrit.wikimedia.org/r/394144 (https://phabricator.wikimedia.org/T166167) (owner: 10Ottomata) [21:34:19] (03Abandoned) 10Cooltey: Add shell accounts cooltey and sharvaniharan [puppet] - 10https://gerrit.wikimedia.org/r/394163 (https://phabricator.wikimedia.org/T173886) (owner: 10Cooltey) [21:36:23] PROBLEM - HHVM rendering on mw2187 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [21:37:13] RECOVERY - HHVM rendering on mw2187 is OK: HTTP OK: HTTP/1.1 200 OK - 75043 bytes in 0.289 second response time [21:46:15] (03CR) 10C. Scott Ananian: "oh boy oh boy oh boy!" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/394104 (https://phabricator.wikimedia.org/T152540) (owner: 10MaxSem) [21:46:31] cscott: scaaaary? [21:46:56] (03PS5) 10Ottomata: Puppetize SSL for Kafka broker [puppet] - 10https://gerrit.wikimedia.org/r/394144 (https://phabricator.wikimedia.org/T166167) [21:48:01] cooltey: just abandoned because it was already done? any issues with access? [21:49:49] mutante not yet, try to push another patch for it. I got my Macbook repaired and lose all the data, so I need to update my SSH key [21:50:24] (03PS1) 10Cooltey: Update cooltey's new SSH Key [puppet] - 10https://gerrit.wikimedia.org/r/394187 [21:50:28] cooltey: ooh, ok, feel free to ping me for it, i handle the access/on duty stuff this week [21:50:31] ah [21:50:34] PROBLEM - Work requests waiting in Zuul Gearman server on contint1001 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [140.0] https://grafana.wikimedia.org/dashboard/db/zuul-gearman?panelId=10fullscreenorgId=1 [21:50:39] mutante: i have a question for you (via pm) when you can please. [21:50:58] Zppix: ok, PM me [21:50:58] Thank you! mutante [21:51:59] !log starting wikidata reindex (T181426) [21:52:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:52:07] T181426: Reindex wikidata to enable description index - https://phabricator.wikimedia.org/T181426 [21:52:44] (03PS4) 10Ayounsi: Bird: add monitoring to the VIP and bird process [puppet] - 10https://gerrit.wikimedia.org/r/393668 (https://phabricator.wikimedia.org/T98006) [21:53:24] (03PS7) 10Ottomata: Puppetize SSL for Kafka broker [puppet] - 10https://gerrit.wikimedia.org/r/394144 (https://phabricator.wikimedia.org/T166167) [21:53:43] (03CR) 10Ayounsi: [C: 032] Bird: add monitoring to the VIP and bird process [puppet] - 10https://gerrit.wikimedia.org/r/393668 (https://phabricator.wikimedia.org/T98006) (owner: 10Ayounsi) [21:54:09] (03PS1) 10Tpt: Properly setup ProofreadPage namespaces for cywikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/394189 (https://phabricator.wikimedia.org/T181406) [22:02:16] 10Operations, 10Scoring-platform-team, 10monitoring, 10Wikimedia-Incident: Send celery and wsgi service logs to logstash - https://phabricator.wikimedia.org/T181630#3798373 (10awight) A slightly related request--it looks like /srv/log/ores/main.log is created by modules/service/manifests/uwsgi.pp, it would... [22:04:03] (03PS1) 10Chad: Move all dblists on noc to dblists/ directory, rather than individually [mediawiki-config] - 10https://gerrit.wikimedia.org/r/394199 [22:04:15] (03PS2) 10Rush: bootstrapvz: add nbd-client package [puppet] - 10https://gerrit.wikimedia.org/r/393853 [22:05:33] (03CR) 10Rush: [C: 032] bootstrapvz: add nbd-client package [puppet] - 10https://gerrit.wikimedia.org/r/393853 (owner: 10Rush) [22:06:20] (03PS11) 10Rush: apt: add --force-confold/--force-confdef dpkg option to apt calls [puppet] - 10https://gerrit.wikimedia.org/r/392421 (https://phabricator.wikimedia.org/T180811) (owner: 10Arturo Borrero Gonzalez) [22:06:22] (03PS1) 10Rush: wip: toolforge: follow attended upgrade process [puppet] - 10https://gerrit.wikimedia.org/r/394200 (https://phabricator.wikimedia.org/T181647) [22:06:56] (03PS12) 10Rush: apt: add --force-confold/--force-confdef dpkg option to apt calls [puppet] - 10https://gerrit.wikimedia.org/r/392421 (https://phabricator.wikimedia.org/T180811) (owner: 10Arturo Borrero Gonzalez) [22:06:58] (03CR) 10jerkins-bot: [V: 04-1] wip: toolforge: follow attended upgrade process [puppet] - 10https://gerrit.wikimedia.org/r/394200 (https://phabricator.wikimedia.org/T181647) (owner: 10Rush) [22:07:05] (03PS2) 10Rush: wip: toolforge: follow attended upgrade process [puppet] - 10https://gerrit.wikimedia.org/r/394200 (https://phabricator.wikimedia.org/T181647) [22:07:31] (03CR) 10jerkins-bot: [V: 04-1] wip: toolforge: follow attended upgrade process [puppet] - 10https://gerrit.wikimedia.org/r/394200 (https://phabricator.wikimedia.org/T181647) (owner: 10Rush) [22:10:20] (03PS2) 10Dzahn: admins: Update cooltey's new SSH Key [puppet] - 10https://gerrit.wikimedia.org/r/394187 (owner: 10Cooltey) [22:10:34] (03CR) 10Dzahn: [C: 032] "confirmed this is cooltey via login on office wiki" [puppet] - 10https://gerrit.wikimedia.org/r/394187 (owner: 10Cooltey) [22:10:53] (03PS3) 10Dzahn: admins: Update cooltey's new SSH Key [puppet] - 10https://gerrit.wikimedia.org/r/394187 (owner: 10Cooltey) [22:14:13] (03CR) 10Krinkle: Move all dblists on noc to dblists/ directory, rather than individually (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/394199 (owner: 10Chad) [22:16:25] (03PS2) 10Kaldari: Enable MP3 uploads on Commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/393661 (https://phabricator.wikimedia.org/T120288) [22:23:14] !log Nodepool had some troubles spawning new instances from 21:09 to 21:36, and took a while to recover. Issue similar to T170492#3581822 [22:23:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:23:24] T170492: figure out if nodepool is overwhelming rabbitmq and/or nova - https://phabricator.wikimedia.org/T170492 [22:30:59] (03CR) 10jerkins-bot: [V: 04-1] Move all dblists on noc to dblists/ directory, rather than individually [mediawiki-config] - 10https://gerrit.wikimedia.org/r/394199 (owner: 10Chad) [22:32:43] (03PS1) 10Chad: Beta: Moving all docroots to standard-docroot [puppet] - 10https://gerrit.wikimedia.org/r/394203 [22:36:03] (03CR) 10Dereckson: [C: 031] Properly setup ProofreadPage namespaces for cywikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/394189 (https://phabricator.wikimedia.org/T181406) (owner: 10Tpt) [22:43:19] (03PS1) 10Addshore: wdbuild: add switch to ease killing [mediawiki-config] - 10https://gerrit.wikimedia.org/r/394207 (https://phabricator.wikimedia.org/T176948) [22:43:21] (03PS1) 10Addshore: wdbuild: extension-list-labs stop using build entry points [mediawiki-config] - 10https://gerrit.wikimedia.org/r/394208 [22:43:23] (03PS1) 10Addshore: wdbuild: Stop using wikidata build on LABS / BETA [mediawiki-config] - 10https://gerrit.wikimedia.org/r/394209 (https://phabricator.wikimedia.org/T176948) [22:43:25] (03PS1) 10Addshore: wdbuild: Stop loading from build on test and testwikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/394210 (https://phabricator.wikimedia.org/T176948) [22:43:27] (03PS1) 10Addshore: wdbuild: Stop loading from build on group0 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/394211 (https://phabricator.wikimedia.org/T176948) [22:43:29] (03PS1) 10Addshore: wdbuild: Stop loading from build on group1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/394212 (https://phabricator.wikimedia.org/T176948) [22:43:41] (03PS1) 10Addshore: wdbuild: Stop loading from build on all wikis (except enwiki) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/394213 (https://phabricator.wikimedia.org/T176948) [22:43:43] (03PS1) 10Addshore: wdbuild: Stop loading from build on all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/394214 (https://phabricator.wikimedia.org/T176948) [22:43:45] (03PS1) 10Addshore: wdbuild: Remove wmgUseWikidataBuild [mediawiki-config] - 10https://gerrit.wikimedia.org/r/394215 [22:43:47] (03PS1) 10Addshore: wdbuild: Remove Wikibase-buildentry.php config file (empty) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/394216 [22:43:58] no_justification: ^^ thats what I'm gonna do on monday :) [22:44:12] aude: ^^ [22:46:33] (03CR) 10Krinkle: "Needs an update to the testNocDblists unit test." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/394199 (owner: 10Chad) [22:46:38] everyone throw beer at addshore <3 [22:46:42] (03CR) 10jerkins-bot: [V: 04-1] wdbuild: extension-list-labs stop using build entry points [mediawiki-config] - 10https://gerrit.wikimedia.org/r/394208 (owner: 10Addshore) [22:46:44] (03CR) 10jerkins-bot: [V: 04-1] wdbuild: add switch to ease killing [mediawiki-config] - 10https://gerrit.wikimedia.org/r/394207 (https://phabricator.wikimedia.org/T176948) (owner: 10Addshore) [22:46:47] PROBLEM - Disk space on graphite1003 is CRITICAL: DISK CRITICAL - free space: /var/lib/carbon 84802 MB (3% inode=97%) [22:46:48] bwhahahahahaaaaaa [22:47:45] (03CR) 10jerkins-bot: [V: 04-1] wdbuild: Stop using wikidata build on LABS / BETA [mediawiki-config] - 10https://gerrit.wikimedia.org/r/394209 (https://phabricator.wikimedia.org/T176948) (owner: 10Addshore) [22:47:47] (03CR) 10jerkins-bot: [V: 04-1] wdbuild: Stop loading from build on test and testwikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/394210 (https://phabricator.wikimedia.org/T176948) (owner: 10Addshore) [22:47:48] phpcs fail, hahaa [22:47:53] * Zppix throws a broken beer bottle at addshore [22:47:53] * addshore will fix that tommorrow [22:47:55] (03CR) 10jerkins-bot: [V: 04-1] wdbuild: Stop loading from build on group0 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/394211 (https://phabricator.wikimedia.org/T176948) (owner: 10Addshore) [22:48:44] (03CR) 10Krinkle: [C: 04-1] "wikimedia.org and wikipedia.org are not symlinks to standard-docroot. All others are, though." [puppet] - 10https://gerrit.wikimedia.org/r/394203 (owner: 10Chad) [22:48:46] (03CR) 10jerkins-bot: [V: 04-1] wdbuild: Stop loading from build on group1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/394212 (https://phabricator.wikimedia.org/T176948) (owner: 10Addshore) [22:48:59] * addshore goes to bed [22:50:25] (03CR) 10jerkins-bot: [V: 04-1] wdbuild: Remove wmgUseWikidataBuild [mediawiki-config] - 10https://gerrit.wikimedia.org/r/394215 (owner: 10Addshore) [22:50:27] (03CR) 10jerkins-bot: [V: 04-1] wdbuild: Stop loading from build on all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/394214 (https://phabricator.wikimedia.org/T176948) (owner: 10Addshore) [22:50:30] (03CR) 10jerkins-bot: [V: 04-1] wdbuild: Stop loading from build on all wikis (except enwiki) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/394213 (https://phabricator.wikimedia.org/T176948) (owner: 10Addshore) [22:58:33] 10Operations, 10Gerrit, 10ORES, 10Scoring-platform-team, and 3 others: Plan migration of ORES repos to git-lfs - https://phabricator.wikimedia.org/T181678#3798554 (10Halfak) Trying start a gerrit review for wheels. Got this: ``` Do you really want to submit the above commits? Type 'yes' to confirm, other... [22:59:47] 10Operations, 10Gerrit, 10ORES, 10Scoring-platform-team, and 3 others: Plan migration of ORES repos to git-lfs - https://phabricator.wikimedia.org/T181678#3798556 (10Halfak) Putting repo backups here: https://analytics.wikimedia.org/datasets/archive/public-datasets/all/ores/ I'm editquality and draftquali... [23:17:56] 10Operations, 10Gerrit, 10ORES, 10Scoring-platform-team, and 3 others: Plan migration of ORES repos to git-lfs - https://phabricator.wikimedia.org/T181678#3798581 (10Halfak) https://github.com/wiki-ai/draftquality is fully updated. [23:35:05] (03PS1) 10Madhuvishy: public_dumps: Revert inclusion of labstore::init class [puppet] - 10https://gerrit.wikimedia.org/r/394224 (https://phabricator.wikimedia.org/T181431) [23:37:02] (03CR) 10Madhuvishy: [C: 032] public_dumps: Revert inclusion of labstore::init class [puppet] - 10https://gerrit.wikimedia.org/r/394224 (https://phabricator.wikimedia.org/T181431) (owner: 10Madhuvishy) [23:43:36] PROBLEM - puppet last run on labstore1007 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [23:47:21] ^ looking [23:48:37] RECOVERY - puppet last run on labstore1007 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [23:48:59] (03PS2) 10Dzahn: wdqs,thumbor: remove ganglia [puppet] - 10https://gerrit.wikimedia.org/r/394153 (https://phabricator.wikimedia.org/T177225) [23:50:00] (03CR) 10Dzahn: [C: 032] wdqs,thumbor: remove ganglia [puppet] - 10https://gerrit.wikimedia.org/r/394153 (https://phabricator.wikimedia.org/T177225) (owner: 10Dzahn) [23:53:47] PROBLEM - puppet last run on snapshot1007 is CRITICAL: CRITICAL: Puppet has 4 failures. Last run 5 minutes ago with 4 failures. Failed resources (up to 3 shown): Exec[ops_ensure_members],Exec[deployment_ensure_members],Exec[absent_ensure_members],Exec[snapshot-admins_ensure_members] [23:54:34] eh, odd, looking at the snapshot one [23:55:36] hmmm [23:56:00] mutante: let me know what you see [23:56:45] madhuvishy: nothing about snapshot-admins when running puppet now [23:57:22] what is the error? [23:57:58] there is no error.. now or when i run it. nothing [23:58:06] Oh i see [23:58:47] RECOVERY - puppet last run on snapshot1007 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [23:59:11] there were no errors when it complained with icinga-wm> IRC echo bot PROBLEM - puppet last run on labstore1007 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues either [23:59:36] RECOVERY - Work requests waiting in Zuul Gearman server on contint1001 is OK: OK: Less than 30.00% above the threshold [90.0] https://grafana.wikimedia.org/dashboard/db/zuul-gearman?panelId=10fullscreenorgId=1 [23:59:41] hmm.. but we did have changes to puppetmaster