[00:00:03] 06Operations, 10Parsoid, 06Services: Investigate 404s for batch request api calls after async processing moved from eqiad -> codfw - https://phabricator.wikimedia.org/T153797#2891214 (10fgiunchedi) I'm trying to reproduce the problem on mw1189, I couldn't find the exact request that parsoid is making to `api... [00:10:27] godog: a second puppet run then installs a bunch of packages, such as fonts and the error is gone it looks [00:10:51] so not persisting, but still not done, will let you know if otherwise [00:11:32] ah I think I know what it is, the exporter isn't creating the 'prometheus' user would be my guess [00:12:01] but then e.g. prometheus-node-exporter does and then it works [00:23:30] (03CR) 10Volans: [C: 04-1] "A couple of missing things and other comments inline." (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/328194 (https://phabricator.wikimedia.org/T153488) (owner: 10Giuseppe Lavagetto) [00:25:09] (03CR) 10Alex Monk: "You don't need this on all wikis with PageAssssments installed?" [puppet] - 10https://gerrit.wikimedia.org/r/326856 (https://phabricator.wikimedia.org/T153026) (owner: 10Kaldari) [00:25:38] PROBLEM - tools homepage -admin tool- on tools.wmflabs.org is CRITICAL: CRITICAL - Socket timeout after 20 seconds [00:27:28] PROBLEM - ElasticSearch health check for shards on relforge1002 is CRITICAL: CRITICAL - elasticsearch http://10.64.37.21:9200/_cluster/health error while fetching: (Connection aborted., error(111, Connection refused)) [00:27:48] PROBLEM - ElasticSearch health check for shards on relforge1001 is CRITICAL: CRITICAL - elasticsearch inactive shards 180 threshold =0.1% breach: status: red, number_of_nodes: 1, unassigned_shards: 180, number_of_pending_tasks: 0, number_of_in_flight_fetch: 0, timed_out: False, active_primary_shards: 180, task_max_waiting_in_queue_millis: 0, cluster_name: relforge-eqiad, relocating_shards: 0, active_shards_percent_as_number: 50 [00:28:18] RECOVERY - tools homepage -admin tool- on tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 3670 bytes in 0.242 second response time [00:30:28] RECOVERY - ElasticSearch health check for shards on relforge1002 is OK: OK - elasticsearch status relforge-eqiad: status: yellow, number_of_nodes: 2, unassigned_shards: 20, number_of_pending_tasks: 0, number_of_in_flight_fetch: 0, timed_out: False, active_primary_shards: 275, task_max_waiting_in_queue_millis: 0, cluster_name: relforge-eqiad, relocating_shards: 0, active_shards_percent_as_number: 93.9058171745, active_shards: 33 [00:30:48] RECOVERY - ElasticSearch health check for shards on relforge1001 is OK: OK - elasticsearch status relforge-eqiad: status: green, number_of_nodes: 2, unassigned_shards: 0, number_of_pending_tasks: 0, number_of_in_flight_fetch: 0, timed_out: False, active_primary_shards: 275, task_max_waiting_in_queue_millis: 0, cluster_name: relforge-eqiad, relocating_shards: 0, active_shards_percent_as_number: 100.0, active_shards: 361, initial [00:31:46] (03CR) 10Kaldari: "@Alex Monk: Right now, it's only being actively used on English Wikipedia. The other 2 wikis it was activated on were mostly for testing p" [puppet] - 10https://gerrit.wikimedia.org/r/326856 (https://phabricator.wikimedia.org/T153026) (owner: 10Kaldari) [00:32:31] (03CR) 10Alex Monk: "that's a common one I think. The other would be a dblist" [puppet] - 10https://gerrit.wikimedia.org/r/326856 (https://phabricator.wikimedia.org/T153026) (owner: 10Kaldari) [00:39:40] 06Operations, 10Parsoid, 06Services: Investigate 404s for batch request api calls after async processing moved from eqiad -> codfw - https://phabricator.wikimedia.org/T153797#2891675 (10fgiunchedi) after chatting with @Pchelolo I've diffed the conftool pools for `api.svc.eqiad.wmnet` and https has servers d... [00:45:04] 06Operations, 06Labs, 10Labs-Infrastructure, 10Monitoring: Have a paging check for Nova API accessible - https://phabricator.wikimedia.org/T133656#2891683 (10AlexMonk-WMF) Basic check added in T42022. My question above should be pretty easy though. [00:49:28] PROBLEM - puppet last run on mw1279 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [00:52:28] PROBLEM - puppet last run on snapshot1005 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [00:54:17] (03Abandoned) 10Paladox: Contint: Downgrade Chromium to 53.0.2785 [puppet] - 10https://gerrit.wikimedia.org/r/328217 (https://phabricator.wikimedia.org/T153597) (owner: 10Paladox) [01:03:07] 06Operations, 10Parsoid, 06Services: Investigate 404s for batch request api calls after async processing moved from eqiad -> codfw - https://phabricator.wikimedia.org/T153797#2891700 (10fgiunchedi) The Parsoid dashboard shows non-200 codes, I was looking for total 200s but couldn't find it yet https://grafan... [01:04:56] 06Operations, 10Parsoid, 06Services: Investigate 404s for batch request api calls after async processing moved from eqiad -> codfw - https://phabricator.wikimedia.org/T153797#2891701 (10Pchelolo) You can find this info at https://grafana.wikimedia.org/dashboard/db/restbase?panelId=13&fullscreen The rate of... [01:09:46] !log mw1168 - remove old salt key, accept new salt key, start minion [01:09:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:11:14] !log dzahn@puppetmaster1001 conftool action : set/pooled=yes; selector: name=mw1168.eqiad.wmnet [01:11:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:18:28] RECOVERY - puppet last run on mw1279 is OK: OK: Puppet is currently enabled, last run 54 seconds ago with 0 failures [01:21:28] RECOVERY - puppet last run on snapshot1005 is OK: OK: Puppet is currently enabled, last run 45 seconds ago with 0 failures [01:23:06] (03PS1) 10Dzahn: dhcp: switch private-c-eqiad to install1001 as tftp [puppet] - 10https://gerrit.wikimedia.org/r/328448 (https://phabricator.wikimedia.org/T132757) [01:23:36] (03PS3) 10Filippo Giunchedi: prometheus: add aggregation rules for MySQL [puppet] - 10https://gerrit.wikimedia.org/r/328425 [01:24:22] (03PS2) 10Dzahn: dhcp: switch private-c-eqiad to install1001 as tftp [puppet] - 10https://gerrit.wikimedia.org/r/328448 (https://phabricator.wikimedia.org/T132757) [01:24:43] (03CR) 10Dzahn: [C: 032] dhcp: switch private-c-eqiad to install1001 as tftp [puppet] - 10https://gerrit.wikimedia.org/r/328448 (https://phabricator.wikimedia.org/T132757) (owner: 10Dzahn) [01:25:43] (03CR) 10Dzahn: [V: 032 C: 032] dhcp: switch private-c-eqiad to install1001 as tftp [puppet] - 10https://gerrit.wikimedia.org/r/328448 (https://phabricator.wikimedia.org/T132757) (owner: 10Dzahn) [01:29:42] 06Operations, 06Labs: openstackclient/keystoneclient on silver broken - https://phabricator.wikimedia.org/T153807#2891728 (10AlexMonk-WMF) [01:39:42] (03CR) 10Filippo Giunchedi: [C: 032] prometheus: add aggregation rules for MySQL [puppet] - 10https://gerrit.wikimedia.org/r/328425 (owner: 10Filippo Giunchedi) [01:39:50] (03PS4) 10Filippo Giunchedi: prometheus: add aggregation rules for MySQL [puppet] - 10https://gerrit.wikimedia.org/r/328425 [01:43:59] (03PS3) 10VolkerE: Make notification logos high-density [mediawiki-config] - 10https://gerrit.wikimedia.org/r/319968 (https://phabricator.wikimedia.org/T147219) (owner: 10Catrope) [01:45:11] (03CR) 10Dzahn: "so, about the existing instances that use this role. If we'd just reconfigure them and add the new role to the existing setup i guess the " [puppet] - 10https://gerrit.wikimedia.org/r/327690 (https://phabricator.wikimedia.org/T139475) (owner: 10Dzahn) [01:47:07] !log mw1169 - schedule 2 hours downtime - boot for reinstall shortly [01:47:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:47:57] !log dzahn@puppetmaster1001 conftool action : set/pooled=no; selector: name=mw1169.eqiad.wmnet [01:47:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:48:56] (03CR) 10VolkerE: "Updated Wikidata logo with manually improved version, also reduced file sizes of all icons from 79 KB to 43 KB with help of TinyPNG." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/319968 (https://phabricator.wikimedia.org/T147219) (owner: 10Catrope) [01:49:36] (03CR) 10VolkerE: "Foundation logo discussion is going on in other patch set, therefore no blocker from my POV any more on this." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/319968 (https://phabricator.wikimedia.org/T147219) (owner: 10Catrope) [01:49:37] !log carbon - temp stop DHCP service to test install from install1001 [01:49:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:51:28] PROBLEM - Disk space on relforge1001 is CRITICAL: DISK CRITICAL - free space: / 14087 MB (15% inode=97%) [01:53:04] 06Operations, 06Commons, 10TimedMediaHandler-Transcode, 10Wikimedia-Video, and 2 others: Commons video transcoders have over 6500 tasks in the backlog. - https://phabricator.wikimedia.org/T153488#2891779 (10Pokefan95) [01:53:10] 15% is a lot left to call it CRIT [01:54:03] have to focus on the install so ignoring relforge1001 [01:54:18] 06Operations, 06Commons, 10TimedMediaHandler-Transcode, 10Wikimedia-Video, and 2 others: Commons video transcoders have over 6500 tasks in the backlog. - https://phabricator.wikimedia.org/T153488#2882187 (10Pokefan95) [02:02:22] !log re-enabling DHCP and puppet [02:02:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:02:29] on carbon [02:06:13] !log reinstalling mw1169 (carbon DHCP, install1001 TFTP) [02:06:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:06:45] sup Samantha [02:07:59] (03PS2) 10Dzahn: install/dhcp: switch all "next-server" from carbon to install1001 [puppet] - 10https://gerrit.wikimedia.org/r/328439 (https://phabricator.wikimedia.org/T132757) [02:08:39] hello! [02:09:28] hi [02:10:28] so replacing the DHCP part of carbon didnt work yet [02:10:35] i need ACL changes on network gear [02:10:44] or at least i'm pretty sure that's it [02:11:12] something to continue on tomorrow.. but still.. carbon much closer to retiring [02:11:35] the TFTP part is done by install1001 just fine and they all have the same roles since today [02:12:52] !log mw1169 - delete salt key, revoke puppet cert [02:12:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:20:57] !log l10nupdate@tin scap sync-l10n completed (1.29.0-wmf.6) (duration: 07m 40s) [02:21:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:23:32] !log mw1169 - reinstall done - sign new puppet cert, initial run... [02:23:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:51:04] !log relforge1001 has huge /var/log/elastichsearch/relforge-eqiad_feature.log that wrote GBs just today but then stopped [02:51:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:51:46] ebernhardson: ^ do you know that file ? [02:52:00] relforge-eqiad_feature.log [02:57:18] i see a "root" is watching that file with "pv" so somebody is aware.. good enough [03:01:38] PROBLEM - puppet last run on sca2003 is CRITICAL: CRITICAL: Puppet has 27 failures. Last run 2 minutes ago with 27 failures. Failed resources (up to 3 shown): Exec[eth0_v6_token],Package[wipe],Package[zotero/translators],Package[zotero/translation-server] [03:02:07] (03PS1) 10Dzahn: move ganglia aggregator eqiad from carbon to install1001 [puppet] - 10https://gerrit.wikimedia.org/r/328450 (https://phabricator.wikimedia.org/T132757) [03:03:05] !log dzahn@puppetmaster1001 conftool action : set/pooled=yes; selector: name=mw1169.eqiad.wmnet [03:03:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:03:56] afk [03:05:25] 06Operations, 06Commons, 10TimedMediaHandler-Transcode, 10Wikimedia-Video, and 2 others: Commons video transcoders have over 6500 tasks in the backlog. - https://phabricator.wikimedia.org/T153488#2882187 (10Dzahn) mw1168 and mw1169 have been reinstalled with the new partman recipe and now have more space i... [03:06:17] (03PS1) 10Tim Landscheidt: Tools: Fully qualify hostnames [puppet] - 10https://gerrit.wikimedia.org/r/328451 (https://phabricator.wikimedia.org/T153608) [03:16:59] (03PS1) 10Tim Landscheidt: Tools: Remove redundant tools-db entry from /etc/hosts [puppet] - 10https://gerrit.wikimedia.org/r/328453 (https://phabricator.wikimedia.org/T139190) [03:29:38] RECOVERY - puppet last run on sca2003 is OK: OK: Puppet is currently enabled, last run 49 seconds ago with 0 failures [03:31:16] (03PS1) 10Tim Landscheidt: deployment-prep: Fully qualify hostnames [puppet] - 10https://gerrit.wikimedia.org/r/328455 (https://phabricator.wikimedia.org/T153608) [03:34:18] PROBLEM - puppet last run on sca1004 is CRITICAL: CRITICAL: Puppet has 27 failures. Last run 2 minutes ago with 27 failures. Failed resources (up to 3 shown): Exec[eth0_v6_token],Package[wipe],Package[zotero/translators],Package[zotero/translation-server] [03:35:23] (03PS1) 10Tim Landscheidt: staging: Fully qualify hostname [puppet] - 10https://gerrit.wikimedia.org/r/328456 (https://phabricator.wikimedia.org/T153608) [03:38:10] (03PS1) 10Tim Landscheidt: trebuchet: Fully qualify hostname [puppet] - 10https://gerrit.wikimedia.org/r/328457 (https://phabricator.wikimedia.org/T153608) [04:03:28] RECOVERY - puppet last run on sca1004 is OK: OK: Puppet is currently enabled, last run 51 seconds ago with 0 failures [04:34:28] PROBLEM - puppet last run on labnodepool1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [04:36:36] !log commtech Added samwilson as project admin [04:36:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:39:48] PROBLEM - puppet last run on sca2004 is CRITICAL: CRITICAL: Puppet has 27 failures. Last run 2 minutes ago with 27 failures. Failed resources (up to 3 shown): Exec[eth0_v6_token],Package[wipe],Package[zotero/translators],Package[zotero/translation-server] [04:48:45] (03PS19) 10Yuvipanda: labs: maintain-dbusers.py for maintaining labsdb users [puppet] - 10https://gerrit.wikimedia.org/r/327157 [04:51:54] (03CR) 10Yuvipanda: [C: 032] labs: maintain-dbusers.py for maintaining labsdb users [puppet] - 10https://gerrit.wikimedia.org/r/327157 (owner: 10Yuvipanda) [04:55:25] (03PS1) 10Yuvipanda: labs: Fix service unit for maintani-dbusers [puppet] - 10https://gerrit.wikimedia.org/r/328459 [04:56:18] PROBLEM - Check systemd state on labstore1004 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [04:57:52] (03CR) 10Yuvipanda: [C: 032] labs: Fix service unit for maintani-dbusers [puppet] - 10https://gerrit.wikimedia.org/r/328459 (owner: 10Yuvipanda) [04:59:15] RECOVERY - Check systemd state on labstore1004 is OK: OK - running: The system is fully operational [04:59:25] RECOVERY - Ensure mysql credential creation for tools users is running on labstore1004 is OK: OK - maintain-dbusers is active [05:02:25] RECOVERY - puppet last run on labnodepool1001 is OK: OK: Puppet is currently enabled, last run 37 seconds ago with 0 failures [05:07:55] RECOVERY - puppet last run on sca2004 is OK: OK: Puppet is currently enabled, last run 37 seconds ago with 0 failures [05:11:13] (03PS1) 10Yuvipanda: labsdbs: Fixup delete-dbusers [puppet] - 10https://gerrit.wikimedia.org/r/328460 [05:11:42] https://meta.wikimedia.org/wiki/Special:GlobalUserRights gives me [WFoPAgpAMFMAAAzgMyEAAABC] 2016-12-21 05:11:30: Fatal exception of type MWException [05:11:55] (03CR) 10Yuvipanda: [V: 032 C: 032] labsdbs: Fixup delete-dbusers [puppet] - 10https://gerrit.wikimedia.org/r/328460 (owner: 10Yuvipanda) [05:12:15] PROBLEM - Check systemd state on labstore1004 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [05:12:25] PROBLEM - Ensure mysql credential creation for tools users is running on labstore1004 is CRITICAL: CRITICAL - Expecting active but unit maintain-dbusers is failed [05:13:15] PROBLEM - Check systemd state on labstore1005 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [05:15:15] RECOVERY - Check systemd state on labstore1004 is OK: OK - running: The system is fully operational [05:15:25] RECOVERY - Ensure mysql credential creation for tools users is running on labstore1004 is OK: OK - maintain-dbusers is active [05:29:39] PROBLEM - Ensure mysql credential creation for tools users is running on labstore1005 is CRITICAL: CRITICAL - Expecting active but unit maintain-dbusers is failed [05:30:40] (am playing with ^ now) [05:30:56] hmm I need to have it shut up on 1005 [05:39:19] RECOVERY - Check systemd state on labstore1005 is OK: OK - running: The system is fully operational [05:39:39] RECOVERY - Ensure mysql credential creation for tools users is running on labstore1005 is OK: OK - maintain-dbusers is active [05:49:19] PROBLEM - mailman I/O stats on fermium is CRITICAL: CRITICAL - I/O stats: Transfers/Sec=4858.80 Read Requests/Sec=468.60 Write Requests/Sec=0.90 KBytes Read/Sec=32574.40 KBytes_Written/Sec=28.80 [05:57:29] PROBLEM - Disk space on labtestnet2001 is CRITICAL: DISK CRITICAL - free space: / 350 MB (3% inode=46%) [06:01:19] RECOVERY - mailman I/O stats on fermium is OK: OK - I/O stats: Transfers/Sec=24.70 Read Requests/Sec=7.30 Write Requests/Sec=2.60 KBytes Read/Sec=29.60 KBytes_Written/Sec=76.40 [06:12:19] 06Operations, 10Annual-Report: add subdomain for annual report 2016 - https://phabricator.wikimedia.org/T151798#2892108 (10Aklapper) >>! In T151798#2828771, @Dzahn wrote: > https://annual.wikimedia.org/2016/ > Does this resolve the ticket to create the URL? > Or did you want to keep it open until the real cont... [06:26:12] 06Operations, 10Annual-Report: add subdomain for annual report 2016 - https://phabricator.wikimedia.org/T151798#2892140 (10Dzahn) a:03ZMcCune @ZMcCune let me know if you have questions about uploading the content to gerrit. i can help with that if needed [06:29:28] 06Operations, 10OCG-General, 06Wiktionary, 13Patch-For-Review: Download as PDF does not work in English Wiktionary: "There was an error while attempting to render your book." - https://phabricator.wikimedia.org/T150604#2892150 (10Aklapper) 05Open>03Resolved a:03Aklapper All patches merged, hence assu... [06:30:01] (03PS1) 10Dzahn: planet: remove wikimedia.org.au feed [puppet] - 10https://gerrit.wikimedia.org/r/328465 (https://phabricator.wikimedia.org/T133620) [06:30:51] 06Operations, 10IDS-extension, 10Wikimedia-Extension-setup, 07I18n: Deploy IDS rendering engine to production - https://phabricator.wikimedia.org/T148693#2892160 (10Samwilson) [06:36:52] (03PS2) 10Dzahn: planet: remove wikimedia.org.au feed [puppet] - 10https://gerrit.wikimedia.org/r/328465 (https://phabricator.wikimedia.org/T133620) [06:37:20] (03PS3) 10Dzahn: planet: remove wikimedia.org.au feed [puppet] - 10https://gerrit.wikimedia.org/r/328465 (https://phabricator.wikimedia.org/T133620) [06:38:44] (03CR) 10Dzahn: [C: 032] planet: remove wikimedia.org.au feed [puppet] - 10https://gerrit.wikimedia.org/r/328465 (https://phabricator.wikimedia.org/T133620) (owner: 10Dzahn) [06:43:37] (03PS1) 10Tim Landscheidt: apache: Fix some issues with apache::static_site [puppet] - 10https://gerrit.wikimedia.org/r/328466 (https://phabricator.wikimedia.org/T153816) [06:43:39] (03PS1) 10Tim Landscheidt: [WIP] aptly: Make aptly work with Apache [puppet] - 10https://gerrit.wikimedia.org/r/328467 (https://phabricator.wikimedia.org/T153814) [06:44:35] (03CR) 10jerkins-bot: [V: 04-1] apache: Fix some issues with apache::static_site [puppet] - 10https://gerrit.wikimedia.org/r/328466 (https://phabricator.wikimedia.org/T153816) (owner: 10Tim Landscheidt) [06:46:59] (03PS2) 10Tim Landscheidt: apache: Fix some issues with apache::static_site [puppet] - 10https://gerrit.wikimedia.org/r/328466 (https://phabricator.wikimedia.org/T153816) [06:47:01] (03PS2) 10Tim Landscheidt: [WIP] aptly: Make aptly work with Apache [puppet] - 10https://gerrit.wikimedia.org/r/328467 (https://phabricator.wikimedia.org/T153814) [06:47:57] (03CR) 10jerkins-bot: [V: 04-1] apache: Fix some issues with apache::static_site [puppet] - 10https://gerrit.wikimedia.org/r/328466 (https://phabricator.wikimedia.org/T153816) (owner: 10Tim Landscheidt) [06:50:29] RECOVERY - Disk space on labtestnet2001 is OK: DISK OK [06:55:03] (03CR) 10Tim Landscheidt: [C: 04-1] "Missed the possibilities of array *and* string for ldap_groups, and need to align arrows." [puppet] - 10https://gerrit.wikimedia.org/r/328466 (https://phabricator.wikimedia.org/T153816) (owner: 10Tim Landscheidt) [06:56:29] PROBLEM - puppet last run on stat1004 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[vim] [07:18:39] PROBLEM - MariaDB Slave SQL: m2 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [07:18:39] PROBLEM - MariaDB Slave IO: s6 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [07:18:39] PROBLEM - MariaDB Slave SQL: s5 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [07:18:39] PROBLEM - MariaDB Slave IO: s3 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [07:18:39] PROBLEM - MariaDB Slave IO: s5 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [07:18:40] PROBLEM - MariaDB Slave IO: m2 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [07:18:40] PROBLEM - MariaDB Slave IO: m3 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [07:18:41] PROBLEM - MariaDB Slave IO: s2 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [07:18:41] PROBLEM - MariaDB Slave IO: s1 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [07:18:42] PROBLEM - MariaDB Slave SQL: s3 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [07:18:42] PROBLEM - MariaDB Slave SQL: s6 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [07:18:42] PROBLEM - MariaDB Slave SQL: x1 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [07:18:49] ^ checking [07:19:29] RECOVERY - MariaDB Slave IO: m2 on dbstore1001 is OK: OK slave_io_state not a slave [07:19:29] RECOVERY - MariaDB Slave SQL: m2 on dbstore1001 is OK: OK slave_sql_state not a slave [07:19:29] RECOVERY - MariaDB Slave IO: s3 on dbstore1001 is OK: OK slave_io_state Slave_IO_Running: Yes [07:19:29] RECOVERY - MariaDB Slave IO: s6 on dbstore1001 is OK: OK slave_io_state Slave_IO_Running: Yes [07:19:29] RECOVERY - MariaDB Slave SQL: s5 on dbstore1001 is OK: OK slave_sql_state Slave_SQL_Running: No, (no error: intentional) [07:19:30] RECOVERY - MariaDB Slave IO: s5 on dbstore1001 is OK: OK slave_io_state Slave_IO_Running: Yes [07:19:30] RECOVERY - MariaDB Slave IO: s2 on dbstore1001 is OK: OK slave_io_state Slave_IO_Running: Yes [07:19:31] RECOVERY - MariaDB Slave IO: m3 on dbstore1001 is OK: OK slave_io_state Slave_IO_Running: Yes [07:19:31] RECOVERY - MariaDB Slave IO: s1 on dbstore1001 is OK: OK slave_io_state Slave_IO_Running: Yes [07:19:32] RECOVERY - MariaDB Slave SQL: s6 on dbstore1001 is OK: OK slave_sql_state Slave_SQL_Running: No, (no error: intentional) [07:19:32] RECOVERY - MariaDB Slave IO: s7 on dbstore1001 is OK: OK slave_io_state Slave_IO_Running: Yes [07:19:32] RECOVERY - MariaDB Slave SQL: s3 on dbstore1001 is OK: OK slave_sql_state Slave_SQL_Running: No, (no error: intentional) [07:20:23] It took ages for me to run show slave 's1' status as the server was a bit loaded, so I guess that is why it alerted. There are several backups being done at the moment so the server is a bit loaded [07:20:42] !log Running optimize table on db1045 for the revision tables as we urgently need some space back on that host - https://phabricator.wikimedia.org/T153739 [07:20:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:23:29] RECOVERY - puppet last run on stat1004 is OK: OK: Puppet is currently enabled, last run 29 seconds ago with 0 failures [07:46:56] 06Operations, 10Parsoid, 06Services: Investigate 404s for batch request api calls after async processing moved from eqiad -> codfw - https://phabricator.wikimedia.org/T153797#2892280 (10Joe) >>! In T153797#2891580, @Pchelolo wrote: > Another interesting type of 404s from parsoid started to appear after move... [07:56:31] (03PS1) 10Elukey: Increase hhvm threads and transcode capabilities on mw116[89] [puppet] - 10https://gerrit.wikimedia.org/r/328473 (https://phabricator.wikimedia.org/T153488) [07:59:25] (03CR) 10Elukey: "https://puppet-compiler.wmflabs.org/4957/ looks good" [puppet] - 10https://gerrit.wikimedia.org/r/328473 (https://phabricator.wikimedia.org/T153488) (owner: 10Elukey) [08:01:40] !log Running optimize table on db1044 for the pagelinks tables as we urgently need some space back on that host - T153826 [08:01:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:01:44] T153826: Defragment db1044 - https://phabricator.wikimedia.org/T153826 [08:26:54] Revent: o/ - mw116[89] should handle a bit more load now, let's see if the queue improves [08:27:03] Yay. [08:27:27] https://grafana.wikimedia.org/dashboard/file/server-board.json?var-server=mw1169 [08:28:42] elukey: Something that was done, at some point, dumped a ton of files (thousands) out as having a “transcode_time_addjob”, a “transcode_time_error”, and no “transcode_time_startwork”…. [08:29:32] The system appears to see them as queued, and runs them, but they end up with a ‘negative’ encoding time (they have a success time, an error time, and no start time) [08:29:54] Revent: do you mean during the past couple of days or way before? [08:30:01] Recently. [08:30:33] That same DB query had previously only showed files that I had ‘intentionally’ broken by resetting them while running. [08:31:03] maybe that was me restarting the jobrunners [08:31:23] I’ve been going through the list, and resetting them to make them have a consistent state in the DB…. it doesn’t ‘add them’ to the queue (they is already queued) but makes them end up with the correct status. [08:31:43] Yeah, that’s what I suspect, that they were ‘supposedly’ started by the bug, but not really. [08:32:26] Anyhow, I’ve fixing them… I rather suspect that these are the files that are being ‘run’ but not showing up as running in the timedmediahandler special page. [08:33:46] so I am seeing Apache fcgi errors now, and 503s logged [08:33:47] sigh [08:35:02] It’s worth noting that the ‘thousands’ I’m referring to all failed at “Error on 12:49, 2016 December 19” or a minute later. [08:37:39] 06Operations, 10Parsoid, 06Services: Investigate 404s for batch request api calls after async processing moved from eqiad -> codfw - https://phabricator.wikimedia.org/T153797#2892463 (10Joe) So, about the unconfigured domain ones: it seems that parsoid sometimes sends out a request without the `Host:` heade... [08:42:49] !log restarted hhvm/jobrunner (and killed ffmpeg processes) on mw116[89] [08:42:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:43:09] Revent: --^ this might trigger errors, but let's see if it helps.. [08:44:40] 07Puppet, 13Patch-For-Review: apache::static_site is not working - https://phabricator.wikimedia.org/T153816#2892485 (10Peachey88) [08:45:23] elukey: I’m vaguely guessing that errors won’t actually show up until the timeout expires... [08:58:59] ok so I am live hacking on mw1168 to figure out why hhvm truncates connections with httpd [08:59:48] (03PS1) 10Muehlenhoff: Also follow stat1001 rename in debdeploy grains [puppet] - 10https://gerrit.wikimedia.org/r/328478 [09:02:52] (03PS5) 10Muehlenhoff: Make systemd-timesyncd available as an alternative time synchronisation provider (WIP) [puppet] - 10https://gerrit.wikimedia.org/r/322279 (https://phabricator.wikimedia.org/T150257) [09:04:55] I found the issue, sigh [09:07:52] (03PS1) 10Elukey: Add hhvm timeouts overrides to hiera for mw116[89] [puppet] - 10https://gerrit.wikimedia.org/r/328479 (https://phabricator.wikimedia.org/T153488) [09:10:32] (03CR) 10Elukey: [C: 032] Add hhvm timeouts overrides to hiera for mw116[89] [puppet] - 10https://gerrit.wikimedia.org/r/328479 (https://phabricator.wikimedia.org/T153488) (owner: 10Elukey) [09:11:15] 06Operations, 10Parsoid, 06Services: Investigate 404s for batch request api calls after async processing moved from eqiad -> codfw - https://phabricator.wikimedia.org/T153797#2892522 (10Joe) After further inspection of the logs, it seems that those are the only 404 errors we get, apart from a few when trying... [09:12:18] elukey: Greatly appreciate you guys working on this… it obviously needed the love, lol. [09:14:51] Revent: same thing for your work :) [09:14:59] (I mean appreciated!) [09:15:23] * elukey tails logs on mw116[89] waiting for good news [09:15:36] <_joe_> elukey: what's the issue? [09:15:46] <_joe_> oh I see [09:15:47] <_joe_> sigh [09:19:48] _joe_ it was an issue between me and puppet, that thing always makes fun of me [09:21:37] on the bright side, load is still good so we might think of raising the runners a bit more later on [09:21:44] <_joe_> I' [09:22:02] <_joe_> I would suggest everyone to let the system work for a few days without interventions [09:22:12] <_joe_> or we won't be able to understand what's happening [09:22:23] <_joe_> and we need to ping the developers too, elukey [09:22:56] _joe_ yes you are right :) [09:27:48] _joe_: Hey. [09:28:15] <_joe_> Revent: as you might guess from the tickets comments, I'm working on something else today :/ [09:28:36] Just to be clear… there are ‘lots’ of queued tasks that have a ‘messed up’ status… they are queued, but have an error time in the DB. [09:29:16] If run without being reset first, they end up with a ‘weird’ status (both an error time and a success time) [09:29:46] Hopefully, I can still work on resetting those (before they are run) so that when run they’ll end up with the right status. [09:30:20] I’m not resetting anything that’s actually ‘running’ anymore. [09:30:29] PROBLEM - puppet last run on analytics1026 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [09:41:21] (03CR) 10Alexandros Kosiaris: [C: 032] package_builder: rebuild Packages only when needed [puppet] - 10https://gerrit.wikimedia.org/r/328221 (owner: 10Filippo Giunchedi) [09:41:26] (03PS2) 10Alexandros Kosiaris: package_builder: rebuild Packages only when needed [puppet] - 10https://gerrit.wikimedia.org/r/328221 (owner: 10Filippo Giunchedi) [09:41:29] (03CR) 10Alexandros Kosiaris: [V: 032 C: 032] package_builder: rebuild Packages only when needed [puppet] - 10https://gerrit.wikimedia.org/r/328221 (owner: 10Filippo Giunchedi) [09:46:33] 06Operations, 10media-storage, 13Patch-For-Review: cronspam cleanup: Cron test -x /usr/sbin/anacron || ( cd / && run-parts --report /etc/cron.monthly ) - https://phabricator.wikimedia.org/T152440#2892599 (10MoritzMuehlenhoff) A fix for this is pending with the next jessie point release, which... [09:58:14] (03PS1) 10Reedy: 3 more to extension.json in extension-list [mediawiki-config] - 10https://gerrit.wikimedia.org/r/328482 (https://phabricator.wikimedia.org/T139800) [09:58:29] RECOVERY - puppet last run on analytics1026 is OK: OK: Puppet is currently enabled, last run 16 seconds ago with 0 failures [09:58:33] !log extending db1035 /srv partition [09:58:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:01:18] !log installing libgme security updates [10:01:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:02:04] (03PS1) 10Reedy: Use wfLoadExtension for 3 more extensions too [mediawiki-config] - 10https://gerrit.wikimedia.org/r/328484 (https://phabricator.wikimedia.org/T140852) [10:35:39] (03PS2) 10Marostegui: misc.my.cnf.erb: Enable barracuda and innodb_strict_mode [puppet] - 10https://gerrit.wikimedia.org/r/321638 (https://phabricator.wikimedia.org/T150949) [10:43:19] !log dropping non-wiki databases from labsdb1001 [10:43:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:49:26] 06Operations, 10TimedMediaHandler, 10hardware-requests: Assign 3 more servers to video scaler duty - https://phabricator.wikimedia.org/T114337#1691895 (10elukey) In T153488 we repurposed two jobrunners to videoscalers (mw116[89]), so now the total eqiad cluster is 4. We spent a bit of time solving an apache<... [10:54:39] PROBLEM - puppet last run on sca1003 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [10:57:24] (03PS2) 10Elukey: Addin Eric Evans to analytics-privatedata for stat1002 [puppet] - 10https://gerrit.wikimedia.org/r/328181 (https://phabricator.wikimedia.org/T153375) (owner: 10Cmjohnson) [10:58:58] (03PS3) 10Elukey: Add Eric Evans to analytics-privatedata [puppet] - 10https://gerrit.wikimedia.org/r/328181 (https://phabricator.wikimedia.org/T153375) (owner: 10Cmjohnson) [11:00:11] (03CR) 10Elukey: [C: 032] Add Eric Evans to analytics-privatedata [puppet] - 10https://gerrit.wikimedia.org/r/328181 (https://phabricator.wikimedia.org/T153375) (owner: 10Cmjohnson) [11:06:02] 06Operations, 10Ops-Access-Requests, 13Patch-For-Review: Requesting access to stat1002.eqiad.wmnet for eevans - https://phabricator.wikimedia.org/T153375#2892947 (10elukey) 05Open>03Resolved a:03elukey Ran puppet on stat100[24], @Eevans you can now ssh and your username is in the `analytics-privatedata... [11:22:39] RECOVERY - puppet last run on sca1003 is OK: OK: Puppet is currently enabled, last run 43 seconds ago with 0 failures [11:37:39] PROBLEM - OTRS SMTP on mendelevium is CRITICAL: connect to address 10.64.32.174 and port 25: Connection refused [11:41:49] PROBLEM - spamassassin on mendelevium is CRITICAL: PROCS CRITICAL: 0 processes with args spamd [11:48:16] (03PS1) 10Giuseppe Lavagetto: tlsproxy::localssl: add ability to have an access.log [puppet] - 10https://gerrit.wikimedia.org/r/328495 (https://phabricator.wikimedia.org/T153797) [11:49:49] RECOVERY - spamassassin on mendelevium is OK: PROCS OK: 1 process with args spamd [11:50:39] RECOVERY - OTRS SMTP on mendelevium is OK: SMTP OK - 0.003 sec. response time [11:53:23] 06Operations, 10Parsoid, 13Patch-For-Review, 06Services (doing), 15User-mobrovac: Investigate 404s for batch request api calls after async processing moved from eqiad -> codfw - https://phabricator.wikimedia.org/T153797#2893086 (10mobrovac) p:05Triage>03High [11:59:37] 06Operations, 06Discovery, 10Traffic, 10Wikidata, and 2 others: Consider switching to HTTPS for Wikidata query service links - https://phabricator.wikimedia.org/T153563#2893089 (10Esc3300) I think it would be good to see some metrics. How many people use these links? Is this already available on grafana? [12:13:57] elukey: Still around? [12:15:38] Or _joe_ really, just want to comment. [12:18:08] (03PS1) 10Muehlenhoff: Add retroactively assigned CVE ID [debs/linux44] - 10https://gerrit.wikimedia.org/r/328498 [12:18:26] (03CR) 10Muehlenhoff: [C: 032] Add retroactively assigned CVE ID [debs/linux44] - 10https://gerrit.wikimedia.org/r/328498 (owner: 10Muehlenhoff) [12:19:16] Meh, you’ll likely see it later. The Special page is showing about 90 transcodes running at a time, which is an appropriate number for the CPU count. There do not seem to be tasks showing up as ‘failed’, and they are completing. [12:21:16] I’m methodically ‘resetting’ transcodes shown on the ‘special page’ as queued, but that show on the file page as ‘error’… these all have a couple of specific times shown for the error that seem to directly connect to server resets… resetting them is just so they end up with a ‘sane’ status when run, instead of negative times or other oddness. [12:22:11] There are still a ton of the ‘messed up’ ones, but the rate of them completing with the ‘broken’ status (both a success and an error time) seems to be going down. [12:23:24] (resetting these does not make the count on the special page of ‘queued’ OR ‘failed’ transcodes change… they just have a broken state in the DB) [12:23:30] * elukey reads [12:25:15] Revent: is there a phab task to track this behavior? If not, it would be really great if you could (whenever you have time) open it and add me in CC [12:25:22] I think the issue with ‘running’ transcodes that are not shown as ‘running’ on the special page is just due to ones with that ‘broken’ status. [12:25:39] Not specifically for this, I don’t think. [12:26:20] no I meant for the ones that are you resetting.. [12:26:33] Ah, nope. [12:26:37] you shouldn't have to do it, it feels wrong :D [12:26:46] Yeah, I know... [12:26:59] maybe it is a bug that we can solve easily [12:27:06] we == ops and devs [12:27:18] To make it clear, I had written this query... [12:27:19] or anybody that has experience with the php code [12:27:31] sure sure :) [12:27:46] this is the current status now from terbium [12:27:47] webVideoTranscode: 15434 queued; 805 claimed (199 active, 606 abandoned); 0 delayed [12:27:59] https://quarry.wmflabs.org/query/14861 <- the description is related to what I was originally using it for, it showed the ones that I had kicked off the queue by resetting them while running. [12:28:26] It was around a hundred or two. [12:28:48] When you reset the servers, after fixing the timeout, it dumped about 5k on there. [12:29:07] :( [12:29:21] definitely we need a phab task to see it if is a bug or a feature :D [12:29:42] I am going afk for a bit, but feel free to write / ping me, I'll read later on! [12:29:52] I think what needs to be fixed is that when a transcode completes successfully, it explicitly clears a previously existing transcode_error and and transcode_time_error [12:30:39] PROBLEM - puppet last run on sca2003 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [12:30:53] But… I don’t think it’s something that would occur when the system was running normally. [12:31:42] (it’s not something I ever saw before my ‘reset while running’, or your server reset) [12:33:19] The error times also seem to be tightly clustered around 12:40 and 13:30 on the 19th, which I believe is when you reset the two servers. [12:34:10] If the system was, due to the insane timeout, trying to start thousands of tasks that never actually ‘started’ because the servers were out of ram, this would make sense. [12:40:39] 06Operations, 06Commons, 10TimedMediaHandler, 10Wikimedia-Video: Creation of derivative video files is not working - https://phabricator.wikimedia.org/T153852#2893187 (10Pokefan95) [12:42:09] (03PS1) 10Urbanecm: Enable SandboxLink on ruwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/328502 (https://phabricator.wikimedia.org/T153855) [12:44:50] (03PS1) 10Muehlenhoff: Another retroactive CVE assignment [debs/linux44] - 10https://gerrit.wikimedia.org/r/328503 [12:45:23] 06Operations, 06Commons, 10TimedMediaHandler, 10Wikimedia-Video: Creation of derivative video files is not working - https://phabricator.wikimedia.org/T153852#2893074 (10Revent) This is known, and being worked. It's simply that the backlog became extremely large due to a high number of 'huge' (1920P, and a... [12:46:52] !log install openjdk-6 security update on labsdb1006 [12:46:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:48:22] (03CR) 10Muehlenhoff: [C: 032] Another retroactive CVE assignment [debs/linux44] - 10https://gerrit.wikimedia.org/r/328503 (owner: 10Muehlenhoff) [12:56:14] 06Operations, 06Commons, 10TimedMediaHandler, 10Wikimedia-Video: Creation of derivative video files is not working - https://phabricator.wikimedia.org/T153852#2893223 (10Pristurus) Okay, thank you very much for this information. [12:57:28] !log mobrovac@tin Starting deploy [parsoid/deploy@dab1f27]: Bug fix for mwApiServer T153797 [12:57:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:57:33] T153797: Investigate 404s for batch request api calls after async processing moved from eqiad -> codfw - https://phabricator.wikimedia.org/T153797 [12:58:05] (03PS1) 10Muehlenhoff: Fix CVE ID for exception table privilege escalation [debs/linux44] - 10https://gerrit.wikimedia.org/r/328507 [12:59:39] RECOVERY - puppet last run on sca2003 is OK: OK: Puppet is currently enabled, last run 37 seconds ago with 0 failures [13:03:01] !log mobrovac@tin Finished deploy [parsoid/deploy@dab1f27]: Bug fix for mwApiServer T153797 (duration: 05m 32s) [13:03:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:03:04] T153797: Investigate 404s for batch request api calls after async processing moved from eqiad -> codfw - https://phabricator.wikimedia.org/T153797 [13:07:13] (03PS1) 10Alexandros Kosiaris: Introduce dbmonitor1001, dbmonitor2001 [puppet] - 10https://gerrit.wikimedia.org/r/328509 (https://phabricator.wikimedia.org/T149557) [13:08:08] (03PS1) 10Muehlenhoff: Update to 4.4.33 [debs/linux44] - 10https://gerrit.wikimedia.org/r/328510 [13:09:33] !log install hdf5 security updates [13:09:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:21:10] (03CR) 10Muehlenhoff: [C: 032] Fix CVE ID for exception table privilege escalation [debs/linux44] - 10https://gerrit.wikimedia.org/r/328507 (owner: 10Muehlenhoff) [13:22:09] 06Operations, 10Parsoid, 13Patch-For-Review, 06Services (doing), 15User-mobrovac: Investigate 404s for batch request api calls after async processing moved from eqiad -> codfw - https://phabricator.wikimedia.org/T153797#2893280 (10Joe) Errors have completely stopped after @mobrovac's patch has been added. [13:32:29] PROBLEM - Disk space on db1035 is CRITICAL: DISK CRITICAL - free space: /srv 63672 MB (3% inode=99%) [13:34:29] that is me [13:34:38] it should reach 1%, then go down [13:36:27] it is depooled in any case [13:36:30] 06Operations, 10Ops-Access-Requests, 10Reading-Web-Trending-Service, 06Services (done), 15User-mobrovac: Allow @Jdlrobson and @bearND to deploy and manage the trending edits service - https://phabricator.wikimedia.org/T153458#2893302 (10mobrovac) [13:39:49] PROBLEM - puppet last run on cp3033 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [13:40:54] 06Operations, 10Parsoid, 06Services (done), 15User-mobrovac: Investigate 404s for batch request api calls after async processing moved from eqiad -> codfw - https://phabricator.wikimedia.org/T153797#2893307 (10mobrovac) 05Open>03Resolved No //batch request failure// errors for the last 40 minutes, so I... [13:46:57] (03CR) 10Marostegui: "https://puppet-compiler.wmflabs.org/4959/" [puppet] - 10https://gerrit.wikimedia.org/r/328352 (https://phabricator.wikimedia.org/T153680) (owner: 10Marostegui) [13:47:53] (03PS3) 10Ema: varnish cachestats.py: add support for defaults and key_prefix [puppet] - 10https://gerrit.wikimedia.org/r/328368 (https://phabricator.wikimedia.org/T151643) [13:48:05] (03CR) 10Ema: [V: 032 C: 032] varnish cachestats.py: add support for defaults and key_prefix [puppet] - 10https://gerrit.wikimedia.org/r/328368 (https://phabricator.wikimedia.org/T151643) (owner: 10Ema) [13:49:29] PROBLEM - puppet last run on elastic1040 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [13:50:49] RECOVERY - puppet last run on cp3033 is OK: OK: Puppet is currently enabled, last run 38 seconds ago with 0 failures [13:54:17] (03PS2) 10Muehlenhoff: Update to 4.4.33 [debs/linux44] - 10https://gerrit.wikimedia.org/r/328510 [13:55:20] (03CR) 10Jcrespo: [WIP] Reporting tests with the private data script (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/328352 (https://phabricator.wikimedia.org/T153680) (owner: 10Marostegui) [13:57:22] (03CR) 10Muehlenhoff: [C: 032] Update to 4.4.33 [debs/linux44] - 10https://gerrit.wikimedia.org/r/328510 (owner: 10Muehlenhoff) [13:57:43] (03PS2) 10Muehlenhoff: Also follow stat1001 rename in debdeploy grains [puppet] - 10https://gerrit.wikimedia.org/r/328478 [14:03:08] ACKNOWLEDGEMENT - Disk space on relforge1001 is CRITICAL: DISK CRITICAL - free space: / 7699 MB (8% inode=97%): Gehel disk space is used by relforge-eqiad_feature.log, related to current investigation by Erik. Log is not growing at this time, waiting for Erik to have a look [14:03:25] ebernhardson: for when you are back ^ [14:07:08] (03CR) 10Muehlenhoff: [C: 032] Also follow stat1001 rename in debdeploy grains [puppet] - 10https://gerrit.wikimedia.org/r/328478 (owner: 10Muehlenhoff) [14:08:14] (03PS2) 10Ema: varnishxcache: port to cachestats.CacheStatsSender [puppet] - 10https://gerrit.wikimedia.org/r/328179 (https://phabricator.wikimedia.org/T151643) [14:08:59] PROBLEM - puppet last run on sca2004 is CRITICAL: CRITICAL: Puppet has 7 failures. Last run 2 minutes ago with 7 failures. Failed resources (up to 3 shown): Package[tzdata],Service[zotero],Exec[zotero-admin_ensure_members],Exec[sc-admins_ensure_members] [14:14:23] 06Operations, 06Commons, 10TimedMediaHandler, 10Wikimedia-Video: Creation of derivative video files is not working - https://phabricator.wikimedia.org/T153852#2893416 (10zhuyifei1999) 05Open>03Invalid Duplicate of {T153488} and {T153747} [14:18:29] RECOVERY - puppet last run on elastic1040 is OK: OK: Puppet is currently enabled, last run 50 seconds ago with 0 failures [14:28:05] 06Operations, 10Wikimedia-Stream: Upstream prematurely closed connection - https://phabricator.wikimedia.org/T153772#2893423 (10ema) Note that this issue has been present for a very long time and is unrelated to the package upgrade performed yesterday and mentioned in T153773. The earliest occurrence of the m... [14:28:41] (03PS1) 10Elukey: Remove MongoDB dependency from statistics cruncher [puppet] - 10https://gerrit.wikimedia.org/r/328519 [14:28:49] PROBLEM - puppet last run on druid1002 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [14:29:22] 06Operations, 10Analytics, 10Traffic: A/B Testing solid framework - https://phabricator.wikimedia.org/T135762#2893426 (10Jan_Dittrich) Is that testing framework also planned to work with central notice/banners, or is that a separate infrastructure? [14:29:51] !log installing imagemagick security updates [14:29:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:37:59] RECOVERY - puppet last run on sca2004 is OK: OK: Puppet is currently enabled, last run 43 seconds ago with 0 failures [14:40:41] (03PS1) 10Giuseppe Lavagetto: changeprop: use codfw's restbase in codfw as well [puppet] - 10https://gerrit.wikimedia.org/r/328521 [14:40:47] <_joe_> mobrovac: ^^ [14:46:51] (03CR) 10Mobrovac: "Should we rather just declare this once in role/common/scb.yaml since the value will always the same for both DCs?" [puppet] - 10https://gerrit.wikimedia.org/r/328521 (owner: 10Giuseppe Lavagetto) [14:47:03] _joe_: ^^ :) [14:47:49] PROBLEM - puppet last run on mw2127 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 2 minutes ago with 2 failures. Failed resources (up to 3 shown): Package[imagemagick],Package[httpry] [14:47:49] PROBLEM - puppet last run on mw2176 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[imagemagick] [14:53:49] RECOVERY - puppet last run on mw2127 is OK: OK: Puppet is currently enabled, last run 32 seconds ago with 0 failures [14:54:06] !log installing ghostscript security updates on trusty hosts [14:54:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:56:49] RECOVERY - puppet last run on druid1002 is OK: OK: Puppet is currently enabled, last run 47 seconds ago with 0 failures [14:59:45] (03CR) 10Volans: "A nitpicking node inline ;)" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/328179 (https://phabricator.wikimedia.org/T151643) (owner: 10Ema) [15:00:04] (03CR) 10Elukey: [C: 032] "https://puppet-compiler.wmflabs.org/4960/" [puppet] - 10https://gerrit.wikimedia.org/r/328519 (owner: 10Elukey) [15:02:06] (03PS1) 10Muehlenhoff: Add snapshot::testbed to standard snapshot debdeploy group [puppet] - 10https://gerrit.wikimedia.org/r/328523 [15:04:15] !log removed mongodb* packages from stat1003 after https://gerrit.wikimedia.org/r/328519 [15:04:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:05:29] (03CR) 10Elukey: [V: 032 C: 032] Add another test to run Varnishkafka with Valgrind [software/varnish/varnishkafka/testing] - 10https://gerrit.wikimedia.org/r/328381 (https://phabricator.wikimedia.org/T147438) (owner: 10Elukey) [15:05:39] PROBLEM - DPKG on snapshot1001 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [15:06:59] PROBLEM - Redis status tcp_6479 on rdb2006 is CRITICAL: CRITICAL ERROR - Redis Library - can not ping 10.192.48.44 on port 6479 [15:07:59] RECOVERY - Redis status tcp_6479 on rdb2006 is OK: OK: REDIS 2.8.17 on 10.192.48.44:6479 has 1 databases (db0) with 3987193 keys, up 51 days 6 hours - replication_delay is 41 [15:08:39] RECOVERY - DPKG on snapshot1001 is OK: All packages OK [15:14:42] (03PS1) 10Elukey: Increase Redis Replica Sync Nagios retry intreval to 2 minutes [puppet] - 10https://gerrit.wikimedia.org/r/328525 [15:15:16] (03PS2) 10Elukey: Increase Redis Replica Sync Nagios retry intreval to 2 minutes [puppet] - 10https://gerrit.wikimedia.org/r/328525 [15:15:39] _joe_ --^ [15:15:50] RECOVERY - puppet last run on mw2176 is OK: OK: Puppet is currently enabled, last run 15 seconds ago with 0 failures [15:16:19] (03PS3) 10Elukey: Increase Redis Replica Sync retry interval to 2 minutes [puppet] - 10https://gerrit.wikimedia.org/r/328525 [15:20:38] <_joe_> heh [15:20:48] <_joe_> yes, I'll take a look elukey [15:22:52] !log truncating /var/log/elasticsearch/relforge-eqiad_feature.log on relforge100[12] [15:22:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:23:29] RECOVERY - Disk space on relforge1001 is OK: DISK OK [15:29:33] (03CR) 10Muehlenhoff: [C: 032] Add snapshot::testbed to standard snapshot debdeploy group [puppet] - 10https://gerrit.wikimedia.org/r/328523 (owner: 10Muehlenhoff) [15:31:29] RECOVERY - Disk space on db1035 is OK: DISK OK [15:32:39] PROBLEM - puppet last run on mc1010 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:34:39] RECOVERY - puppet last run on mc1010 is OK: OK: Puppet is currently enabled, last run 35 seconds ago with 0 failures [15:40:42] (03PS16) 10Alexandros Kosiaris: Add profile::kubernetes::node profile class [puppet] - 10https://gerrit.wikimedia.org/r/324212 [15:40:44] (03PS15) 10Alexandros Kosiaris: Include ::profile::kubernetes::node in role::kubernetes::worker [puppet] - 10https://gerrit.wikimedia.org/r/324213 [15:40:46] (03PS3) 10Alexandros Kosiaris: kubernetes::master: Introduce the kubernetes profile [puppet] - 10https://gerrit.wikimedia.org/r/328174 [15:40:48] (03PS3) 10Alexandros Kosiaris: Create and assign the kubernetes::master role [puppet] - 10https://gerrit.wikimedia.org/r/328175 [15:47:25] (03PS1) 10Muehlenhoff: Update to 4.4.34 [debs/linux44] - 10https://gerrit.wikimedia.org/r/328529 [15:49:48] (03CR) 10Giuseppe Lavagetto: mediawiki::scaler: check orphaned HHVM threads (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/328194 (https://phabricator.wikimedia.org/T153488) (owner: 10Giuseppe Lavagetto) [15:49:57] (03PS1) 10Andrew Bogott: Wikitech: include openstack::clientlib on silver [puppet] - 10https://gerrit.wikimedia.org/r/328530 (https://phabricator.wikimedia.org/T153807) [15:50:09] PROBLEM - check_payments_wiki on payments2003 is CRITICAL: HTTP CRITICAL - Invalid HTTP response received from host on port 443: HTTP/1.1 301 Moved Permanently [15:50:09] PROBLEM - check_payments_wiki on payments2002 is CRITICAL: HTTP CRITICAL - Invalid HTTP response received from host on port 443: HTTP/1.1 301 Moved Permanently [15:52:03] ^ fixing [15:53:41] (03CR) 10Andrew Bogott: [C: 032] Wikitech: include openstack::clientlib on silver [puppet] - 10https://gerrit.wikimedia.org/r/328530 (https://phabricator.wikimedia.org/T153807) (owner: 10Andrew Bogott) [15:54:58] (03CR) 10Muehlenhoff: [C: 032] Update to 4.4.34 [debs/linux44] - 10https://gerrit.wikimedia.org/r/328529 (owner: 10Muehlenhoff) [15:55:09] PROBLEM - check_payments_wiki on payments2002 is CRITICAL: HTTP CRITICAL - Invalid HTTP response received from host on port 443: HTTP/1.1 301 Moved Permanently [15:55:09] PROBLEM - check_payments_wiki on payments2003 is CRITICAL: HTTP CRITICAL - Invalid HTTP response received from host on port 443: HTTP/1.1 301 Moved Permanently [15:57:15] (03PS1) 10Muehlenhoff: Update to 4.4.35 [debs/linux44] - 10https://gerrit.wikimedia.org/r/328531 [15:57:39] PROBLEM - puppet last run on maps1002 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:58:57] (03PS2) 10Giuseppe Lavagetto: changeprop: use codfw's restbase in codfw as well [puppet] - 10https://gerrit.wikimedia.org/r/328521 [15:59:27] (03PS2) 10Milimetric: Upgrade edit_history to mediawiki_history [puppet] - 10https://gerrit.wikimedia.org/r/325572 [15:59:57] <_joe_> mobrovac: merging ^^ [16:00:09] PROBLEM - check_payments_wiki on payments2002 is CRITICAL: HTTP CRITICAL - Invalid HTTP response received from host on port 443: HTTP/1.1 301 Moved Permanently [16:00:09] PROBLEM - check_payments_wiki on payments2003 is CRITICAL: HTTP CRITICAL - Invalid HTTP response received from host on port 443: HTTP/1.1 301 Moved Permanently [16:00:52] kk _joe_, +1 from me [16:01:03] (03CR) 10Mobrovac: [C: 031] changeprop: use codfw's restbase in codfw as well [puppet] - 10https://gerrit.wikimedia.org/r/328521 (owner: 10Giuseppe Lavagetto) [16:01:17] (03PS3) 10Giuseppe Lavagetto: changeprop: use codfw's restbase in codfw as well [puppet] - 10https://gerrit.wikimedia.org/r/328521 [16:01:28] (03CR) 10Giuseppe Lavagetto: [V: 032 C: 032] changeprop: use codfw's restbase in codfw as well [puppet] - 10https://gerrit.wikimedia.org/r/328521 (owner: 10Giuseppe Lavagetto) [16:04:19] RECOVERY - check_payments_wiki on payments2002 is OK: HTTP OK: Status line output matched HTTP/1.1 503 - 214 bytes in 0.013 second response time [16:04:29] RECOVERY - check_payments_wiki on payments2003 is OK: HTTP OK: Status line output matched HTTP/1.1 503 - 214 bytes in 0.013 second response time [16:05:09] PROBLEM - check_listener_gc on saiph is CRITICAL: HTTP CRITICAL - Invalid HTTP response received from host on port 443: HTTP/1.1 404 Not Found [16:05:09] PROBLEM - check_listener_ipn on saiph is CRITICAL: HTTP CRITICAL - Invalid HTTP response received from host on port 443: HTTP/1.1 404 Not Found [16:05:59] (03PS3) 10Gehel: New upstream version: 1.11.0 [debs/logstash-gelf] - 10https://gerrit.wikimedia.org/r/320992 (https://phabricator.wikimedia.org/T150408) [16:06:19] RECOVERY - check_listener_gc on saiph is OK: HTTP OK: Status line output matched HTTP/1.1 503 - 214 bytes in 0.014 second response time [16:06:19] RECOVERY - check_listener_ipn on saiph is OK: HTTP OK: Status line output matched HTTP/1.1 503 - 214 bytes in 0.012 second response time [16:06:30] 06Operations, 06Labs, 13Patch-For-Review: openstackclient/keystoneclient on silver broken - https://phabricator.wikimedia.org/T153807#2893685 (10Andrew) 05Open>03Resolved a:03Andrew Attached patch plus a bit of fussing with packages (python-oslo-serialization vs python-oslo.serialization) and this is f... [16:11:57] (03PS3) 10Milimetric: Upgrade edit_history to mediawiki_history [puppet] - 10https://gerrit.wikimedia.org/r/325572 [16:12:29] PROBLEM - puppet last run on analytics1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [16:13:39] PROBLEM - puppet last run on labtestweb2001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [16:16:59] (03CR) 10Joal: [C: 031] "LGTM !" [puppet] - 10https://gerrit.wikimedia.org/r/325572 (owner: 10Milimetric) [16:17:58] (03PS4) 10Elukey: Upgrade edit_history to mediawiki_history [puppet] - 10https://gerrit.wikimedia.org/r/325572 (owner: 10Milimetric) [16:18:26] (03PS1) 10Yuvipanda: labsdb: Add delete user functionality to maintain-dbusers [puppet] - 10https://gerrit.wikimedia.org/r/328533 [16:22:58] (03PS2) 10Giuseppe Lavagetto: mediawiki::scaler: check orphaned HHVM threads [puppet] - 10https://gerrit.wikimedia.org/r/328194 (https://phabricator.wikimedia.org/T153488) [16:24:04] 06Operations, 06Analytics-Kanban, 06Reading-Web-Backlog, 10Traffic: mobile-safari has very few internally-referred pageviews - https://phabricator.wikimedia.org/T148780#2893736 (10mforns) [16:25:39] RECOVERY - puppet last run on maps1002 is OK: OK: Puppet is currently enabled, last run 50 seconds ago with 0 failures [16:28:18] (03CR) 10Giuseppe Lavagetto: [C: 031] "Although those alarms are a symptom of a real problem, I agree." [puppet] - 10https://gerrit.wikimedia.org/r/328525 (owner: 10Elukey) [16:29:04] (03PS4) 10Elukey: Increase Redis Replica Sync retry interval to 2 minutes [puppet] - 10https://gerrit.wikimedia.org/r/328525 [16:31:15] (03CR) 10Elukey: [C: 032] Increase Redis Replica Sync retry interval to 2 minutes [puppet] - 10https://gerrit.wikimedia.org/r/328525 (owner: 10Elukey) [16:31:51] (03PS5) 10Milimetric: Upgrade edit-history to mediawiki-history-beta [puppet] - 10https://gerrit.wikimedia.org/r/325572 [16:33:34] (03PS6) 10Elukey: Upgrade edit-history to mediawiki-history-beta [puppet] - 10https://gerrit.wikimedia.org/r/325572 (owner: 10Milimetric) [16:35:02] (03CR) 10Elukey: [C: 032] Upgrade edit-history to mediawiki-history-beta [puppet] - 10https://gerrit.wikimedia.org/r/325572 (owner: 10Milimetric) [16:38:13] elukey, nuria: i'm all set on stat1002; thank you! [16:38:35] \o/ [16:39:19] PROBLEM - pivot on thorium is CRITICAL: connect to address 10.64.53.26 and port 9090: Connection refused [16:40:04] (03PS3) 10Giuseppe Lavagetto: mediawiki::scaler: check orphaned HHVM threads [puppet] - 10https://gerrit.wikimedia.org/r/328194 (https://phabricator.wikimedia.org/T153488) [16:40:10] pivot is me [16:40:18] <_joe_> elukey: merging this ^^ FYI [16:40:19] RECOVERY - pivot on thorium is OK: TCP OK - 0.000 second response time on 10.64.53.26 port 9090 [16:40:29] _joe_ ack! [16:40:47] (03CR) 10Giuseppe Lavagetto: [V: 032 C: 032] mediawiki::scaler: check orphaned HHVM threads [puppet] - 10https://gerrit.wikimedia.org/r/328194 (https://phabricator.wikimedia.org/T153488) (owner: 10Giuseppe Lavagetto) [16:41:42] (03PS1) 10Milimetric: Fix typo in pivot config yaml [puppet] - 10https://gerrit.wikimedia.org/r/328537 [16:42:29] RECOVERY - puppet last run on analytics1001 is OK: OK: Puppet is currently enabled, last run 18 seconds ago with 0 failures [16:42:49] PROBLEM - puppet last run on cp3035 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [16:50:42] (03PS2) 10Milimetric: Fix typo in pivot config yaml [puppet] - 10https://gerrit.wikimedia.org/r/328537 [16:51:28] 06Operations, 10Ops-Access-Requests, 10Reading-Web-Trending-Service, 06Services (done), 15User-mobrovac: Allow @Jdlrobson and @bearND to deploy and manage the trending edits service - https://phabricator.wikimedia.org/T153458#2893783 (10bearND) Thank you. I'll try it out early next year when we can do de... [16:54:19] (03PS3) 10Elukey: Fix typo in pivot config yaml [puppet] - 10https://gerrit.wikimedia.org/r/328537 (owner: 10Milimetric) [16:54:34] 06Operations, 06Performance-Team, 06Reading-Web-Backlog, 10Traffic, and 3 others: Performance review #2 of Hovercards (Popups extension) - https://phabricator.wikimedia.org/T70861#2893795 (10dr0ptp4kt) [16:55:19] (03CR) 10Elukey: [C: 032] Fix typo in pivot config yaml [puppet] - 10https://gerrit.wikimedia.org/r/328537 (owner: 10Milimetric) [16:56:14] biggest mediawikis we know of, by number of "good" pages: 1) wikidata 2) http://lietuvai.lt/wiki/ 3) https://www.wikiocity.com 4) en.wp [17:00:08] 06Operations, 06Performance-Team, 06Reading-Infrastructure-Team, 06Reading-Web-Backlog, and 4 others: Deploy TemplateStyles to WMF production - https://phabricator.wikimedia.org/T133410#2893805 (10dr0ptp4kt) @Tgr @Anomie: during our Q3 FY 2016-2017 interlock with Performance and Technical Operations @Gille... [17:02:55] 06Operations, 06Performance-Team, 06Reading-Web-Backlog, 10Traffic, and 3 others: Performance review #2 of Hovercards (Popups extension) - https://phabricator.wikimedia.org/T70861#2893816 (10elukey) >>! In T70861#2893792, @dr0ptp4kt wrote: > You should also ensure @BBlack and @elukey from #traffic are in t... [17:04:38] 06Operations, 06Performance-Team, 06Reading-Web-Backlog, 10Traffic, and 3 others: Performance review #2 of Hovercards (Popups extension) - https://phabricator.wikimedia.org/T70861#2893819 (10dr0ptp4kt) Sorry @elukey, hi @ema. [17:05:35] elukey: sorry about that! i fixed the other ticket to say ema as well :) [17:07:17] (03PS1) 10Joal: Update pivot conf for mediawiki history beta [puppet] - 10https://gerrit.wikimedia.org/r/328540 [17:07:23] elukey: --^ if you have time [17:08:11] dr0ptp4kt: no worries! I was chatting with Ema today about similar things happened recently, our usernames are close enough to be swappend sometimes :D [17:09:47] :) [17:11:05] RECOVERY - puppet last run on cp3035 is OK: OK: Puppet is currently enabled, last run 25 seconds ago with 0 failures [17:11:32] 06Operations, 10Continuous-Integration-Config, 10Continuous-Integration-Infrastructure, 07Zuul: Zuul has started failing on some repo's in gerrit.wikimedia.org - https://phabricator.wikimedia.org/T153877#2893837 (10hashar) @Andrew pbr 1.10.0 is broken it fails to recognize a version such as the Zuul one `2... [17:13:42] 06Operations, 10Continuous-Integration-Config, 10Continuous-Integration-Infrastructure, 07Zuul: Zuul has started failing on some repo's in gerrit.wikimedia.org - https://phabricator.wikimedia.org/T153877#2893846 (10Paladox) @hashar but it works on another test instance for me. gerrit-test. root@gerrit-te... [17:15:31] joal: would you mind to also remove the old #description comment? I forgot to ask it in the previous code review.. [17:15:39] elukey: doing ! [17:16:07] (that would be the super long comment) [17:16:24] joal: mmm not sure if we need it or not [17:16:26] maybe I am mistaken [17:16:31] I'll let you decide! [17:16:51] elukey: I don't mind having it nor deleting it :) [17:17:06] all right let's leave it for now [17:17:15] ok, anything else? [17:17:17] 06Operations, 10DBA, 06Labs, 07Tracking: Database replication problems - production and labs (tracking) - https://phabricator.wikimedia.org/T50930#2893853 (10jcrespo) [17:17:21] (03CR) 10Elukey: [C: 032] Update pivot conf for mediawiki history beta [puppet] - 10https://gerrit.wikimedia.org/r/328540 (owner: 10Joal) [17:17:26] Yay, thanks elukey :) [17:30:59] 06Operations, 10DBA, 06Labs, 07Tracking: Database replication problems - production and labs (tracking) - https://phabricator.wikimedia.org/T50930#2893895 (10jcrespo) [17:32:27] (03CR) 10Muehlenhoff: [C: 032] Update to 4.4.35 [debs/linux44] - 10https://gerrit.wikimedia.org/r/328531 (owner: 10Muehlenhoff) [17:41:24] (03PS1) 10Joal: Correct pivot conf for mediawiki history beta [puppet] - 10https://gerrit.wikimedia.org/r/328545 [17:41:31] elukey: as promised --^ [17:41:32] :) [17:42:45] (03PS1) 10Muehlenhoff: Update to 4.4.36 [debs/linux44] - 10https://gerrit.wikimedia.org/r/328546 [17:47:38] (03PS1) 10Dzahn: add europium.eqiad.wmnet [dns] - 10https://gerrit.wikimedia.org/r/328547 [17:48:48] (03PS2) 10Dzahn: add europium.eqiad.wmnet [dns] - 10https://gerrit.wikimedia.org/r/328547 [17:49:20] (03CR) 10Elukey: [C: 032] Correct pivot conf for mediawiki history beta [puppet] - 10https://gerrit.wikimedia.org/r/328545 (owner: 10Joal) [17:49:35] (03CR) 10Dzahn: [C: 032] add europium.eqiad.wmnet [dns] - 10https://gerrit.wikimedia.org/r/328547 (owner: 10Dzahn) [17:52:01] joal: done! [17:52:05] (03PS1) 10Dzahn: Revert "remove dhcp entry for old hostname europium" [puppet] - 10https://gerrit.wikimedia.org/r/328548 [17:52:28] (03CR) 10jerkins-bot: [V: 04-1] Revert "remove dhcp entry for old hostname europium" [puppet] - 10https://gerrit.wikimedia.org/r/328548 (owner: 10Dzahn) [17:53:07] (03CR) 10Dzahn: "i found this orphaned host and it still had the old OS on it. i reopened T82239 and just re-adding this because i want to test installserv" [puppet] - 10https://gerrit.wikimedia.org/r/328548 (owner: 10Dzahn) [17:53:43] PROBLEM - puppet last run on sca1003 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:54:15] well, turns out maybe hitting revert on a change from 2013 doesnt always work that well [17:58:30] <_joe_> lol [18:02:57] (03PS1) 10Dzahn: dhcp: re-add europium but in eqiad.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/328550 [18:05:13] PROBLEM - check_payments_wiki on payments2001 is CRITICAL: HTTP CRITICAL - Invalid HTTP response received from host on port 443: HTTP/1.1 301 Moved Permanently [18:07:37] grrr. payments1001 ^^ I see it [18:08:35] (03PS2) 10Dzahn: dhcp: re-add europium but in eqiad.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/328550 [18:09:31] (03CR) 10Dzahn: [C: 032] dhcp: re-add europium but in eqiad.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/328550 (owner: 10Dzahn) [18:10:10] (03Abandoned) 10Dzahn: Revert "remove dhcp entry for old hostname europium" [puppet] - 10https://gerrit.wikimedia.org/r/328548 (owner: 10Dzahn) [18:13:08] !log carbon - temp stopping dhcp server [18:13:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:14:12] (03PS4) 10Alexandros Kosiaris: kubernetes::master: Introduce the kubernetes profile [puppet] - 10https://gerrit.wikimedia.org/r/328174 [18:14:14] (03PS4) 10Alexandros Kosiaris: Create and assign the kubernetes::master role [puppet] - 10https://gerrit.wikimedia.org/r/328175 [18:14:16] (03PS17) 10Alexandros Kosiaris: Add profile::kubernetes::node profile class [puppet] - 10https://gerrit.wikimedia.org/r/324212 [18:14:18] (03PS16) 10Alexandros Kosiaris: Include ::profile::kubernetes::node in role::kubernetes::worker [puppet] - 10https://gerrit.wikimedia.org/r/324213 [18:14:20] (03PS1) 10Alexandros Kosiaris: k8s::apiserver: Allow specifying the SSL file paths [puppet] - 10https://gerrit.wikimedia.org/r/328553 [18:16:33] PROBLEM - puppet last run on elastic1018 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [18:22:43] RECOVERY - puppet last run on sca1003 is OK: OK: Puppet is currently enabled, last run 21 seconds ago with 0 failures [18:27:01] (03PS4) 10Marostegui: [WIP] Reporting tests with the private data script [puppet] - 10https://gerrit.wikimedia.org/r/328352 (https://phabricator.wikimedia.org/T153680) [18:29:37] (03CR) 10Marostegui: [WIP] Reporting tests with the private data script (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/328352 (https://phabricator.wikimedia.org/T153680) (owner: 10Marostegui) [18:31:37] 06Operations, 10Continuous-Integration-Config, 10Continuous-Integration-Infrastructure, 07Zuul: Zuul has started failing on some repo's in gerrit.wikimedia.org - https://phabricator.wikimedia.org/T153877#2894082 (10hashar) We had version 0.8.2 The CI instance have an unattended-upgrade for repositories *-... [18:33:23] RECOVERY - check_payments_wiki on payments2001 is OK: HTTP OK: HTTP/1.1 200 OK - 258 bytes in 0.013 second response time [18:34:24] (03PS1) 10Andrew Bogott: Try to pin python-pbr [puppet] - 10https://gerrit.wikimedia.org/r/328555 [18:44:33] RECOVERY - puppet last run on elastic1018 is OK: OK: Puppet is currently enabled, last run 12 seconds ago with 0 failures [18:46:45] (03PS3) 10Dzahn: install/dhcp: switch all "next-server" from carbon to install1001 [puppet] - 10https://gerrit.wikimedia.org/r/328439 (https://phabricator.wikimedia.org/T132757) [18:47:11] (03PS1) 10Filippo Giunchedi: prometheus: fix syntax for mysql recording rules [puppet] - 10https://gerrit.wikimedia.org/r/328557 [18:47:20] (03CR) 10Dzahn: "the only thing i could imagine is a problem here is that labs subnets are somehow special / need other config" [puppet] - 10https://gerrit.wikimedia.org/r/328439 (https://phabricator.wikimedia.org/T132757) (owner: 10Dzahn) [18:50:37] (03CR) 10Alexandros Kosiaris: [C: 031] install/dhcp: switch all "next-server" from carbon to install1001 [puppet] - 10https://gerrit.wikimedia.org/r/328439 (https://phabricator.wikimedia.org/T132757) (owner: 10Dzahn) [18:59:26] 06Operations, 10Annual-Report: add subdomain for annual report 2016 - https://phabricator.wikimedia.org/T151798#2894164 (10ZMcCune) @Heather: Any reason not to use https://annual.wikimedia.org/2016/? @Dzahn: Thank you! We will let you know. Hope to have the static pages ready in early January. [19:03:55] (03PS2) 10Andrew Bogott: Pin python-pbr to an old version for Zuul [puppet] - 10https://gerrit.wikimedia.org/r/328555 (https://phabricator.wikimedia.org/T153877) [19:04:38] Hi operations! Quick question: what kind of bad things could happen to a network that we could measure? Besides slowness? Maybe dropped connections? [19:04:52] Anyway to measure the rate of those by geographic region? [19:05:13] PROBLEM - Redis status tcp_6479 on rdb2006 is CRITICAL: CRITICAL ERROR - Redis Library - can not ping 10.192.48.44 on port 6479 [19:05:59] (03PS3) 10Andrew Bogott: Pin python-pbr to an old version for Zuul [puppet] - 10https://gerrit.wikimedia.org/r/328555 (https://phabricator.wikimedia.org/T153877) [19:06:08] (03CR) 10Alexandros Kosiaris: [C: 031] Pin python-pbr to an old version for Zuul [puppet] - 10https://gerrit.wikimedia.org/r/328555 (https://phabricator.wikimedia.org/T153877) (owner: 10Andrew Bogott) [19:06:13] RECOVERY - Redis status tcp_6479 on rdb2006 is OK: OK: REDIS 2.8.17 on 10.192.48.44:6479 has 1 databases (db0) with 3979921 keys, up 51 days 10 hours - replication_delay is 43 [19:06:59] bblack: ema: ^ ? (apologies for bugging u so often 8p) [19:07:46] (03PS5) 10Alexandros Kosiaris: kubernetes::master: Introduce the kubernetes profile [puppet] - 10https://gerrit.wikimedia.org/r/328174 [19:07:48] (03PS5) 10Alexandros Kosiaris: Create and assign the kubernetes::master role [puppet] - 10https://gerrit.wikimedia.org/r/328175 [19:07:50] (03PS18) 10Alexandros Kosiaris: Add profile::kubernetes::node profile class [puppet] - 10https://gerrit.wikimedia.org/r/324212 [19:07:52] (03PS17) 10Alexandros Kosiaris: Include ::profile::kubernetes::node in role::kubernetes::worker [puppet] - 10https://gerrit.wikimedia.org/r/324213 [19:08:01] AndyRussG: fwiw, i know we have probes that are part of https://atlas.ripe.net/ [19:08:30] so that kind of exists already [19:09:06] 06Operations, 10ops-codfw, 06DC-Ops, 06Discovery, and 2 others: elastic2020 is powered off and does not want to restart - https://phabricator.wikimedia.org/T149006#2739503 (10Deskana) @gehel Has everything gone as planned? I assume silence on this ticket is good news. :-) [19:11:22] 06Operations, 10ops-codfw, 06DC-Ops, 06Discovery, and 2 others: elastic2020 is powered off and does not want to restart - https://phabricator.wikimedia.org/T149006#2894219 (10Gehel) Silence is a good thing! But traffic has left codfw again, and not long after the firmware upgrade by @Papaul. So it works,... [19:12:20] (03CR) 10Dzahn: [C: 032] install/dhcp: switch all "next-server" from carbon to install1001 [puppet] - 10https://gerrit.wikimedia.org/r/328439 (https://phabricator.wikimedia.org/T132757) (owner: 10Dzahn) [19:12:26] mutante: cool, thanks! [19:12:26] (03CR) 10Andrew Bogott: [C: 032] Pin python-pbr to an old version for Zuul [puppet] - 10https://gerrit.wikimedia.org/r/328555 (https://phabricator.wikimedia.org/T153877) (owner: 10Andrew Bogott) [19:12:35] (03PS4) 10Andrew Bogott: Pin python-pbr to an old version for Zuul [puppet] - 10https://gerrit.wikimedia.org/r/328555 (https://phabricator.wikimedia.org/T153877) [19:13:53] !log carbon - re-enabled puppet and DHCP [19:13:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:15:22] !log public1-b-eqiad and public1-c-eqiad are configured to use install1001 as DHCP, all others still use carbon as DHCP | all subnets now use install1001 as TFTP [19:15:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:16:08] 06Operations, 10Continuous-Integration-Config, 10Continuous-Integration-Infrastructure, 13Patch-For-Review, 07Zuul: Zuul has started failing on some repo's in gerrit.wikimedia.org - https://phabricator.wikimedia.org/T153877#2894226 (10hashar) Cherry picked https://gerrit.wikimedia.org/r/328555 on the CI... [19:16:16] 06Operations, 10ops-codfw, 06DC-Ops, 06Discovery, and 2 others: elastic2020 is powered off and does not want to restart - https://phabricator.wikimedia.org/T149006#2894227 (10Deskana) >>! In T149006#2894219, @Gehel wrote: > Silence is a good thing! But traffic has left codfw again, and not long after the f... [19:16:36] 06Operations, 10ops-codfw, 06DC-Ops, 06Discovery, and 2 others: elastic2020 is powered off and does not want to restart - https://phabricator.wikimedia.org/T149006#2739503 (10Deskana) 05Open>03Resolved [19:17:04] 06Operations, 10Annual-Report: add subdomain for annual report 2016 - https://phabricator.wikimedia.org/T151798#2894232 (10Heather) That looks good. Thanks, everyone! [19:17:54] is relieved [19:18:01] that we are ok with annual.wm/2016 [19:25:36] 06Operations, 10Continuous-Integration-Config, 10Continuous-Integration-Infrastructure, 13Patch-For-Review, 07Zuul: Zuul has started failing on some repo's in gerrit.wikimedia.org - https://phabricator.wikimedia.org/T153877#2894252 (10hashar) 05Open>03Resolved a:03Andrew Ran puppet on contint1001 /... [19:27:01] (03PS2) 10Filippo Giunchedi: prometheus: fix syntax for mysql recording rules [puppet] - 10https://gerrit.wikimedia.org/r/328557 [19:28:32] (03CR) 10Filippo Giunchedi: [C: 032] prometheus: fix syntax for mysql recording rules [puppet] - 10https://gerrit.wikimedia.org/r/328557 (owner: 10Filippo Giunchedi) [19:34:43] PROBLEM - puppet last run on sca1004 is CRITICAL: CRITICAL: Puppet has 27 failures. Last run 2 minutes ago with 27 failures. Failed resources (up to 3 shown): Exec[eth0_v6_token],Package[wipe],Package[zotero/translators],Package[zotero/translation-server] [19:39:53] (03CR) 10Filippo Giunchedi: "> I may need help to apply this to mysql-aggregated." [puppet] - 10https://gerrit.wikimedia.org/r/328425 (owner: 10Filippo Giunchedi) [19:52:05] (03PS2) 10Yuvipanda: labsdb: Add delete user functionality to maintain-dbusers [puppet] - 10https://gerrit.wikimedia.org/r/328533 [19:55:21] (03PS3) 10Yuvipanda: labsdb: Add delete user functionality to maintain-dbusers [puppet] - 10https://gerrit.wikimedia.org/r/328533 [19:55:28] (03CR) 10Yuvipanda: [V: 032 C: 032] labsdb: Add delete user functionality to maintain-dbusers [puppet] - 10https://gerrit.wikimedia.org/r/328533 (owner: 10Yuvipanda) [20:01:43] PROBLEM - puppet last run on mw1282 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [20:02:43] (03PS1) 10Yuvipanda: labsdb: Run maintain-dbusers only on active NFS host [puppet] - 10https://gerrit.wikimedia.org/r/328564 [20:02:43] RECOVERY - puppet last run on sca1004 is OK: OK: Puppet is currently enabled, last run 48 seconds ago with 0 failures [20:04:09] (03Abandoned) 10Yuvipanda: base: Use the standard location for puppet ca [puppet] - 10https://gerrit.wikimedia.org/r/257275 (owner: 10Yuvipanda) [20:04:13] (03CR) 10jerkins-bot: [V: 04-1] labsdb: Run maintain-dbusers only on active NFS host [puppet] - 10https://gerrit.wikimedia.org/r/328564 (owner: 10Yuvipanda) [20:04:26] (03Abandoned) 10Yuvipanda: labs: Limit who can login via the ssh key lookup tool too [puppet] - 10https://gerrit.wikimedia.org/r/259455 (owner: 10Yuvipanda) [20:04:37] (03Abandoned) 10Yuvipanda: base: Do not do add host nagios monitoring in labs [puppet] - 10https://gerrit.wikimedia.org/r/262359 (https://phabricator.wikimedia.org/T122757) (owner: 10Yuvipanda) [20:04:43] (03Abandoned) 10Yuvipanda: toollabs: Point shadow to correct master host [puppet] - 10https://gerrit.wikimedia.org/r/265199 (owner: 10Yuvipanda) [20:05:26] (03Abandoned) 10Yuvipanda: cache: Add labtestspice.wikimedia.org behind misc varnish [puppet] - 10https://gerrit.wikimedia.org/r/301178 (owner: 10Yuvipanda) [20:05:31] (03Abandoned) 10Yuvipanda: tools: Do not have static class inherit from toollabs [puppet] - 10https://gerrit.wikimedia.org/r/278431 (https://phabricator.wikimedia.org/T128411) (owner: 10Yuvipanda) [20:06:49] (03Abandoned) 10Yuvipanda: ssh: Disable 2fa for labs [puppet] - 10https://gerrit.wikimedia.org/r/318981 (https://phabricator.wikimedia.org/T147998) (owner: 10Yuvipanda) [20:07:28] (03PS5) 10Yuvipanda: labs: Clean out projects that don't exist anymore from mounts [puppet] - 10https://gerrit.wikimedia.org/r/327522 [20:08:32] (03CR) 10Yuvipanda: [V: 032 C: 032] labs: Clean out projects that don't exist anymore from mounts [puppet] - 10https://gerrit.wikimedia.org/r/327522 (owner: 10Yuvipanda) [20:10:05] (03Abandoned) 10Yuvipanda: [WIP] ldap: Cleanup module [puppet] - 10https://gerrit.wikimedia.org/r/287663 (owner: 10Yuvipanda) [20:10:45] (03Abandoned) 10Yuvipanda: tools: Use phabricator as source for kubernetes building [puppet] - 10https://gerrit.wikimedia.org/r/303727 (https://phabricator.wikimedia.org/T142448) (owner: 10Yuvipanda) [20:11:07] (03PS3) 10Yuvipanda: puppetmaster: Cleanup unused vars / crons in labs puppetmaster [puppet] - 10https://gerrit.wikimedia.org/r/312317 [20:11:37] (03Abandoned) 10Yuvipanda: puppet: Disable enc by default on trusty for now [puppet] - 10https://gerrit.wikimedia.org/r/311761 (owner: 10Yuvipanda) [20:18:03] PROBLEM - Unmerged changes on repository puppet on puppetmaster1001 is CRITICAL: There is one unmerged change in puppet (dir /var/lib/git/operations/puppet, ref HEAD..origin/production). [20:20:27] (03CR) 10Yuvipanda: [C: 032] Tools: Quote arguments in clush [puppet] - 10https://gerrit.wikimedia.org/r/326380 (owner: 10Tim Landscheidt) [20:21:09] ^ was me [20:21:10] sorry [20:22:03] RECOVERY - Unmerged changes on repository puppet on puppetmaster1001 is OK: No changes to merge. [20:25:30] (03PS15) 10Paladox: Gerrit: Enable logstash in gerrit [puppet] - 10https://gerrit.wikimedia.org/r/326177 (https://phabricator.wikimedia.org/T141324) [20:29:43] RECOVERY - puppet last run on mw1282 is OK: OK: Puppet is currently enabled, last run 8 seconds ago with 0 failures [20:30:29] (03CR) 10Yuvipanda: "I personally like to stick to bash, but I don't particularly care either way..." [puppet] - 10https://gerrit.wikimedia.org/r/326379 (owner: 10Tim Landscheidt) [20:30:35] (03PS2) 10Yuvipanda: Tools: Remove bashisms from clush [puppet] - 10https://gerrit.wikimedia.org/r/326379 (owner: 10Tim Landscheidt) [20:30:41] (03CR) 10Yuvipanda: [V: 032 C: 032] Tools: Remove bashisms from clush [puppet] - 10https://gerrit.wikimedia.org/r/326379 (owner: 10Tim Landscheidt) [20:30:51] (03PS2) 10Yuvipanda: Tools: Quote arguments in clush [puppet] - 10https://gerrit.wikimedia.org/r/326380 (owner: 10Tim Landscheidt) [20:30:55] (03CR) 10Yuvipanda: [V: 032 C: 032] Tools: Quote arguments in clush [puppet] - 10https://gerrit.wikimedia.org/r/326380 (owner: 10Tim Landscheidt) [20:41:03] PROBLEM - Unmerged changes on repository puppet on puppetmaster1001 is CRITICAL: There are 2 unmerged changes in puppet (dir /var/lib/git/operations/puppet, ref HEAD..origin/production). [20:41:53] PROBLEM - puppet last run on ms-be1016 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [20:43:03] RECOVERY - Unmerged changes on repository puppet on puppetmaster1001 is OK: No changes to merge. [20:45:26] 06Operations, 13Patch-For-Review: Split carbon's install/mirror roles, provision install1001 - https://phabricator.wikimedia.org/T132757#2894515 (10Dzahn) - added test host europium (https://gerrit.wikimedia.org/r/328547, https://gerrit.wikimedia.org/r/328550, https://gerrit.wikimedia.org/r/328548) - @akosiari... [20:47:45] (03PS2) 10Yuvipanda: labsdb: Run maintain-dbusers only on active NFS host [puppet] - 10https://gerrit.wikimedia.org/r/328564 [20:48:06] (03PS3) 10Yuvipanda: labsdb: Run maintain-dbusers only on active NFS host [puppet] - 10https://gerrit.wikimedia.org/r/328564 [20:48:48] (03CR) 10jerkins-bot: [V: 04-1] labsdb: Run maintain-dbusers only on active NFS host [puppet] - 10https://gerrit.wikimedia.org/r/328564 (owner: 10Yuvipanda) [20:48:59] 06Operations, 13Patch-For-Review: Split carbon's install/mirror roles, provision install1001 - https://phabricator.wikimedia.org/T132757#2894537 (10Dzahn) other things we need: - move ganglia aggregator - change ACLs / ferm rules for webproxy -- (it's still webproxy 1H IN CNAME carbon.wikimedia.org. in... [20:52:15] (03PS4) 10Yuvipanda: labsdb: Run maintain-dbusers only on active NFS host [puppet] - 10https://gerrit.wikimedia.org/r/328564 [20:53:23] (03CR) 10jerkins-bot: [V: 04-1] labsdb: Run maintain-dbusers only on active NFS host [puppet] - 10https://gerrit.wikimedia.org/r/328564 (owner: 10Yuvipanda) [20:54:12] (03PS5) 10Yuvipanda: labsdb: Run maintain-dbusers only on active NFS host [puppet] - 10https://gerrit.wikimedia.org/r/328564 [20:57:40] (03PS1) 10Urbanecm: Set sortPrepend for gdwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/328570 (https://phabricator.wikimedia.org/T153900) [20:58:15] (03CR) 10jerkins-bot: [V: 04-1] Set sortPrepend for gdwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/328570 (https://phabricator.wikimedia.org/T153900) (owner: 10Urbanecm) [20:58:24] (03PS6) 10Yuvipanda: labsdb: Run maintain-dbusers only on active NFS host [puppet] - 10https://gerrit.wikimedia.org/r/328564 [20:58:41] (03CR) 10Yuvipanda: [V: 032 C: 032] labsdb: Run maintain-dbusers only on active NFS host [puppet] - 10https://gerrit.wikimedia.org/r/328564 (owner: 10Yuvipanda) [20:59:23] (03PS2) 10Urbanecm: Set sortPrepend for gdwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/328570 (https://phabricator.wikimedia.org/T153900) [21:01:30] (03Draft1) 10Paladox: Gerrit: Convert from utf8 to utf8mb4 [puppet] - 10https://gerrit.wikimedia.org/r/328571 (https://phabricator.wikimedia.org/T153899) [21:03:18] (03PS2) 10Paladox: Gerrit: Convert from utf8 to utf8mb4 [puppet] - 10https://gerrit.wikimedia.org/r/328571 (https://phabricator.wikimedia.org/T153899) [21:05:47] (03PS1) 10Yuvipanda: labsdb: Followup to I3156a406f37dd5344273faf5c770c32eddee0e25 [puppet] - 10https://gerrit.wikimedia.org/r/328572 [21:06:20] (03CR) 10jerkins-bot: [V: 04-1] labsdb: Followup to I3156a406f37dd5344273faf5c770c32eddee0e25 [puppet] - 10https://gerrit.wikimedia.org/r/328572 (owner: 10Yuvipanda) [21:07:13] PROBLEM - puppet last run on labstore1004 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [21:08:03] (03PS2) 10Yuvipanda: labsdb: Followup to I3156a406f37dd5344273faf5c770c32eddee0e25 [puppet] - 10https://gerrit.wikimedia.org/r/328572 [21:08:35] (03CR) 10jerkins-bot: [V: 04-1] labsdb: Followup to I3156a406f37dd5344273faf5c770c32eddee0e25 [puppet] - 10https://gerrit.wikimedia.org/r/328572 (owner: 10Yuvipanda) [21:09:00] (03PS3) 10Yuvipanda: labsdb: Followup to I3156a406f37dd5344273faf5c770c32eddee0e25 [puppet] - 10https://gerrit.wikimedia.org/r/328572 [21:10:03] RECOVERY - puppet last run on ms-be1016 is OK: OK: Puppet is currently enabled, last run 39 seconds ago with 0 failures [21:11:34] 06Operations, 10Wikimedia-Site-requests, 07Chinese-Sites, 07Community-consensus-needed: Enable "upload by url" feature at zhwiki - https://phabricator.wikimedia.org/T142991#2894683 (10Dzahn) [21:12:07] (03CR) 10Yuvipanda: [C: 032] labsdb: Followup to I3156a406f37dd5344273faf5c770c32eddee0e25 [puppet] - 10https://gerrit.wikimedia.org/r/328572 (owner: 10Yuvipanda) [21:13:21] 06Operations, 10Wikimedia-Site-requests, 07Chinese-Sites, 07Community-consensus-needed: Enable "upload by url" feature at zhwiki - https://phabricator.wikimedia.org/T142991#2553271 (10Dzahn) re-adding Operations because of "There is still the question to get ops green light for upload.wikimedia.org copy by... [21:13:59] 06Operations, 10Wikimedia-Site-requests, 07Chinese-Sites, 07Community-consensus-needed: Enable "upload by url" feature at zhwiki - https://phabricator.wikimedia.org/T142991#2894692 (10Dzahn) p:05Triage>03Normal [21:14:06] 06Operations, 10Wikimedia-Site-requests, 07Chinese-Sites, 07Community-consensus-needed: Enable "upload by url" feature at zhwiki - https://phabricator.wikimedia.org/T142991#2553271 (10Dzahn) 05Open>03stalled [21:14:13] RECOVERY - puppet last run on labstore1004 is OK: OK: Puppet is currently enabled, last run 30 seconds ago with 0 failures [21:24:55] (03CR) 10Alex Monk: "It may be difficult to find someone who has the necessary privileges, knowledge and interest to do this instead" [puppet] - 10https://gerrit.wikimedia.org/r/287663 (owner: 10Yuvipanda) [21:25:34] (03CR) 10Yuvipanda: "Yup :(" [puppet] - 10https://gerrit.wikimedia.org/r/287663 (owner: 10Yuvipanda) [21:39:23] (03CR) 10Andrew Bogott: "I'm sure that these can go, but best to merge this after I'm back from traveling." [puppet] - 10https://gerrit.wikimedia.org/r/318451 (owner: 10Dzahn) [22:16:44] (03PS1) 10Smalyshev: Add configuration for query endpoint URL [puppet] - 10https://gerrit.wikimedia.org/r/328582 (https://phabricator.wikimedia.org/T153897) [22:17:28] 06Operations, 05Prometheus-metrics-monitoring: Improvements to Ganglia-equivalent Prometheus dashboards - https://phabricator.wikimedia.org/T152791#2894804 (10fgiunchedi) @ArielGlenn indeed the stacked graphs are meant for cluster-wide overviews, would the breakdown per-host be enough in this case for what you... [22:18:42] (03CR) 10Dzahn: [C: 031] Introduce dbmonitor1001, dbmonitor2001 [puppet] - 10https://gerrit.wikimedia.org/r/328509 (https://phabricator.wikimedia.org/T149557) (owner: 10Alexandros Kosiaris) [22:39:03] 06Operations, 05Prometheus-metrics-monitoring: Improvements to Ganglia-equivalent Prometheus dashboards - https://phabricator.wikimedia.org/T152791#2894847 (10ArielGlenn) >>! In T152791#2894804, @fgiunchedi wrote: > @ArielGlenn indeed the stacked graphs are meant for cluster-wide overviews, would the breakdown... [22:43:43] PROBLEM - puppet last run on labvirt1011 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [23:07:53] PROBLEM - puppet last run on labvirt1006 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [23:11:43] RECOVERY - puppet last run on labvirt1011 is OK: OK: Puppet is currently enabled, last run 3 seconds ago with 0 failures [23:19:41] (03PS1) 10Matanya: remove absented file long gone [puppet] - 10https://gerrit.wikimedia.org/r/328596 [23:31:10] 06Operations: Setup europium as locke replacement - https://phabricator.wikimedia.org/T82239#2894958 (10Dzahn) [23:31:17] !log europium - re-installing with jessie (T82239) [23:31:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:31:23] T82239: Setup europium as locke replacement - https://phabricator.wikimedia.org/T82239 [23:34:43] 06Operations: reclaim europium - https://phabricator.wikimedia.org/T153918#2894991 (10Dzahn) [23:35:53] RECOVERY - puppet last run on labvirt1006 is OK: OK: Puppet is currently enabled, last run 15 seconds ago with 0 failures [23:36:03] 06Operations: reclaim europium - https://phabricator.wikimedia.org/T153918#2894991 (10Dzahn) https://gerrit.wikimedia.org/r/#/c/328547/ https://gerrit.wikimedia.org/r/#/c/328550/ https://gerrit.wikimedia.org/r/#/c/328548/ ---- 15:32 < mutante> !log europium - re-installing with jessie (T82239) ---- https://r... [23:36:31] 06Operations: Setup europium as locke replacement - https://phabricator.wikimedia.org/T82239#898279 (10Dzahn) 05Open>03Resolved created subtask for reclaim [23:37:03] 06Operations, 10ops-eqiad: reclaim europium - https://phabricator.wikimedia.org/T153918#2895016 (10Dzahn) [23:37:54] 06Operations, 10ops-eqiad: reclaim europium - https://phabricator.wikimedia.org/T153918#2895020 (10Dzahn) europium.eqiad.wmnet - eqiad row C - C7 @ 38 [23:39:13] PROBLEM - puppet last run on sca2004 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [23:45:43] Reedy: https://phabricator.wikimedia.org/T153920 can you please triage ? [23:45:56] matanya: It's a dupe [23:46:04] oh, sorry [23:46:21] :) [23:48:04] !log europium - jessie reinstall done - powered down until until reclaim (T153918) [23:48:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:48:07] T153918: reclaim europium - https://phabricator.wikimedia.org/T153918 [23:49:18] 06Operations, 10ops-eqiad: reclaim europium - https://phabricator.wikimedia.org/T153918#2895108 (10Dzahn) a:05RobH>03None [23:50:43] PROBLEM - puppet last run on db1024 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [23:51:44] (03PS8) 10Andrew Bogott: Keystone: Move api service to uwsgi/nginx [puppet] - 10https://gerrit.wikimedia.org/r/328400 (https://phabricator.wikimedia.org/T150774) [23:55:23] (03PS1) 10Dzahn: openstack: switch tftp server from carbon to install1001 [puppet] - 10https://gerrit.wikimedia.org/r/328597 (https://phabricator.wikimedia.org/T123733)