[00:00:03] <wikibugs>	 06Operations, 10Parsoid, 06Services: Investigate 404s for batch request api calls after async processing moved from eqiad -> codfw - https://phabricator.wikimedia.org/T153797#2891214 (10fgiunchedi) I'm trying to reproduce the problem on mw1189, I couldn't find the exact request that parsoid is making to `api...
[00:10:27] <mutante>	 godog: a second puppet run then installs a bunch of packages, such as fonts and the error is gone it looks
[00:10:51] <mutante>	 so not persisting, but still not done, will let you know if otherwise
[00:11:32] <godog>	 ah I think I know what it is, the exporter isn't creating the 'prometheus' user would be my guess
[00:12:01] <godog>	 but then e.g. prometheus-node-exporter does and then it works
[00:23:30] <wikibugs>	 (03CR) 10Volans: [C: 04-1] "A couple of missing things and other comments inline." (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/328194 (https://phabricator.wikimedia.org/T153488) (owner: 10Giuseppe Lavagetto)
[00:25:09] <wikibugs>	 (03CR) 10Alex Monk: "You don't need this on all wikis with PageAssssments installed?" [puppet] - 10https://gerrit.wikimedia.org/r/326856 (https://phabricator.wikimedia.org/T153026) (owner: 10Kaldari)
[00:25:38] <icinga-wm>	 PROBLEM - tools homepage -admin tool- on tools.wmflabs.org is CRITICAL: CRITICAL - Socket timeout after 20 seconds
[00:27:28] <icinga-wm>	 PROBLEM - ElasticSearch health check for shards on relforge1002 is CRITICAL: CRITICAL - elasticsearch http://10.64.37.21:9200/_cluster/health error while fetching: (Connection aborted., error(111, Connection refused))
[00:27:48] <icinga-wm>	 PROBLEM - ElasticSearch health check for shards on relforge1001 is CRITICAL: CRITICAL - elasticsearch inactive shards 180 threshold =0.1% breach: status: red, number_of_nodes: 1, unassigned_shards: 180, number_of_pending_tasks: 0, number_of_in_flight_fetch: 0, timed_out: False, active_primary_shards: 180, task_max_waiting_in_queue_millis: 0, cluster_name: relforge-eqiad, relocating_shards: 0, active_shards_percent_as_number: 50
[00:28:18] <icinga-wm>	 RECOVERY - tools homepage -admin tool- on tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 3670 bytes in 0.242 second response time
[00:30:28] <icinga-wm>	 RECOVERY - ElasticSearch health check for shards on relforge1002 is OK: OK - elasticsearch status relforge-eqiad: status: yellow, number_of_nodes: 2, unassigned_shards: 20, number_of_pending_tasks: 0, number_of_in_flight_fetch: 0, timed_out: False, active_primary_shards: 275, task_max_waiting_in_queue_millis: 0, cluster_name: relforge-eqiad, relocating_shards: 0, active_shards_percent_as_number: 93.9058171745, active_shards: 33
[00:30:48] <icinga-wm>	 RECOVERY - ElasticSearch health check for shards on relforge1001 is OK: OK - elasticsearch status relforge-eqiad: status: green, number_of_nodes: 2, unassigned_shards: 0, number_of_pending_tasks: 0, number_of_in_flight_fetch: 0, timed_out: False, active_primary_shards: 275, task_max_waiting_in_queue_millis: 0, cluster_name: relforge-eqiad, relocating_shards: 0, active_shards_percent_as_number: 100.0, active_shards: 361, initial
[00:31:46] <wikibugs>	 (03CR) 10Kaldari: "@Alex Monk: Right now, it's only being actively used on English Wikipedia. The other 2 wikis it was activated on were mostly for testing p" [puppet] - 10https://gerrit.wikimedia.org/r/326856 (https://phabricator.wikimedia.org/T153026) (owner: 10Kaldari)
[00:32:31] <wikibugs>	 (03CR) 10Alex Monk: "that's a common one I think. The other would be a dblist" [puppet] - 10https://gerrit.wikimedia.org/r/326856 (https://phabricator.wikimedia.org/T153026) (owner: 10Kaldari)
[00:39:40] <wikibugs>	 06Operations, 10Parsoid, 06Services: Investigate 404s for batch request api calls after async processing moved from eqiad -> codfw - https://phabricator.wikimedia.org/T153797#2891675 (10fgiunchedi) after chatting with @Pchelolo  I've diffed the conftool pools for `api.svc.eqiad.wmnet` and https has servers d...
[00:45:04] <wikibugs>	 06Operations, 06Labs, 10Labs-Infrastructure, 10Monitoring: Have a paging check for Nova API accessible - https://phabricator.wikimedia.org/T133656#2891683 (10AlexMonk-WMF) Basic check added in T42022. My question above should be pretty easy though.
[00:49:28] <icinga-wm>	 PROBLEM - puppet last run on mw1279 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[00:52:28] <icinga-wm>	 PROBLEM - puppet last run on snapshot1005 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[00:54:17] <wikibugs>	 (03Abandoned) 10Paladox: Contint: Downgrade Chromium to 53.0.2785 [puppet] - 10https://gerrit.wikimedia.org/r/328217 (https://phabricator.wikimedia.org/T153597) (owner: 10Paladox)
[01:03:07] <wikibugs>	 06Operations, 10Parsoid, 06Services: Investigate 404s for batch request api calls after async processing moved from eqiad -> codfw - https://phabricator.wikimedia.org/T153797#2891700 (10fgiunchedi) The Parsoid dashboard shows non-200 codes, I was looking for total 200s but couldn't find it yet https://grafan...
[01:04:56] <wikibugs>	 06Operations, 10Parsoid, 06Services: Investigate 404s for batch request api calls after async processing moved from eqiad -> codfw - https://phabricator.wikimedia.org/T153797#2891701 (10Pchelolo) You can find this info at https://grafana.wikimedia.org/dashboard/db/restbase?panelId=13&fullscreen  The rate of...
[01:09:46] <mutante>	 !log mw1168 - remove old salt key, accept new salt key, start minion
[01:09:49] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[01:11:14] <logmsgbot>	 !log dzahn@puppetmaster1001 conftool action : set/pooled=yes; selector: name=mw1168.eqiad.wmnet
[01:11:16] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[01:18:28] <icinga-wm>	 RECOVERY - puppet last run on mw1279 is OK: OK: Puppet is currently enabled, last run 54 seconds ago with 0 failures
[01:21:28] <icinga-wm>	 RECOVERY - puppet last run on snapshot1005 is OK: OK: Puppet is currently enabled, last run 45 seconds ago with 0 failures
[01:23:06] <wikibugs>	 (03PS1) 10Dzahn: dhcp: switch private-c-eqiad to install1001 as tftp [puppet] - 10https://gerrit.wikimedia.org/r/328448 (https://phabricator.wikimedia.org/T132757)
[01:23:36] <wikibugs>	 (03PS3) 10Filippo Giunchedi: prometheus: add aggregation rules for MySQL [puppet] - 10https://gerrit.wikimedia.org/r/328425
[01:24:22] <wikibugs>	 (03PS2) 10Dzahn: dhcp: switch private-c-eqiad to install1001 as tftp [puppet] - 10https://gerrit.wikimedia.org/r/328448 (https://phabricator.wikimedia.org/T132757)
[01:24:43] <wikibugs>	 (03CR) 10Dzahn: [C: 032] dhcp: switch private-c-eqiad to install1001 as tftp [puppet] - 10https://gerrit.wikimedia.org/r/328448 (https://phabricator.wikimedia.org/T132757) (owner: 10Dzahn)
[01:25:43] <wikibugs>	 (03CR) 10Dzahn: [V: 032 C: 032] dhcp: switch private-c-eqiad to install1001 as tftp [puppet] - 10https://gerrit.wikimedia.org/r/328448 (https://phabricator.wikimedia.org/T132757) (owner: 10Dzahn)
[01:29:42] <wikibugs>	 06Operations, 06Labs: openstackclient/keystoneclient on silver broken - https://phabricator.wikimedia.org/T153807#2891728 (10AlexMonk-WMF)
[01:39:42] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 032] prometheus: add aggregation rules for MySQL [puppet] - 10https://gerrit.wikimedia.org/r/328425 (owner: 10Filippo Giunchedi)
[01:39:50] <wikibugs>	 (03PS4) 10Filippo Giunchedi: prometheus: add aggregation rules for MySQL [puppet] - 10https://gerrit.wikimedia.org/r/328425
[01:43:59] <wikibugs>	 (03PS3) 10VolkerE: Make notification logos high-density [mediawiki-config] - 10https://gerrit.wikimedia.org/r/319968 (https://phabricator.wikimedia.org/T147219) (owner: 10Catrope)
[01:45:11] <wikibugs>	 (03CR) 10Dzahn: "so, about the existing instances that use this role. If we'd just reconfigure them and add the new role to the existing setup i guess the " [puppet] - 10https://gerrit.wikimedia.org/r/327690 (https://phabricator.wikimedia.org/T139475) (owner: 10Dzahn)
[01:47:07] <mutante>	 !log mw1169 - schedule 2 hours downtime - boot for reinstall shortly 
[01:47:09] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[01:47:57] <logmsgbot>	 !log dzahn@puppetmaster1001 conftool action : set/pooled=no; selector: name=mw1169.eqiad.wmnet
[01:47:59] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[01:48:56] <wikibugs>	 (03CR) 10VolkerE: "Updated Wikidata logo with manually improved version, also reduced file sizes of all icons from 79 KB to 43 KB with help of TinyPNG." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/319968 (https://phabricator.wikimedia.org/T147219) (owner: 10Catrope)
[01:49:36] <wikibugs>	 (03CR) 10VolkerE: "Foundation logo discussion is going on in other patch set, therefore no blocker from my POV any more on this." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/319968 (https://phabricator.wikimedia.org/T147219) (owner: 10Catrope)
[01:49:37] <mutante>	 !log carbon - temp stop DHCP service to test install from install1001
[01:49:39] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[01:51:28] <icinga-wm>	 PROBLEM - Disk space on relforge1001 is CRITICAL: DISK CRITICAL - free space: / 14087 MB (15% inode=97%)
[01:53:04] <wikibugs>	 06Operations, 06Commons, 10TimedMediaHandler-Transcode, 10Wikimedia-Video, and 2 others: Commons video transcoders have over 6500 tasks in the backlog. - https://phabricator.wikimedia.org/T153488#2891779 (10Pokefan95)
[01:53:10] <mutante>	 15% is a lot left to call it CRIT
[01:54:03] <mutante>	 have to focus on the install so ignoring relforge1001
[01:54:18] <wikibugs>	 06Operations, 06Commons, 10TimedMediaHandler-Transcode, 10Wikimedia-Video, and 2 others: Commons video transcoders have over 6500 tasks in the backlog. - https://phabricator.wikimedia.org/T153488#2882187 (10Pokefan95)
[02:02:22] <mutante>	 !log re-enabling DHCP and puppet
[02:02:24] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[02:02:29] <mutante>	 on carbon
[02:06:13] <mutante>	 !log reinstalling mw1169 (carbon DHCP, install1001 TFTP)
[02:06:15] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[02:06:45] <mutante>	 sup Samantha
[02:07:59] <wikibugs>	 (03PS2) 10Dzahn: install/dhcp: switch all "next-server" from carbon to install1001 [puppet] - 10https://gerrit.wikimedia.org/r/328439 (https://phabricator.wikimedia.org/T132757)
[02:08:39] <SamanthaNguyen>	 hello!
[02:09:28] <mutante>	 hi
[02:10:28] <mutante>	 so replacing the DHCP part of carbon didnt work yet
[02:10:35] <mutante>	 i need ACL changes on network gear 
[02:10:44] <mutante>	 or at least i'm pretty sure that's it
[02:11:12] <mutante>	 something to continue on tomorrow.. but still.. carbon much closer to retiring 
[02:11:35] <mutante>	 the TFTP part is done by install1001 just fine and they all have the same roles since today
[02:12:52] <mutante>	 !log mw1169 - delete salt key, revoke puppet cert
[02:12:55] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[02:20:57] <logmsgbot>	 !log l10nupdate@tin scap sync-l10n completed (1.29.0-wmf.6) (duration: 07m 40s)
[02:21:00] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[02:23:32] <mutante>	 !log mw1169 - reinstall done - sign new puppet cert, initial run...
[02:23:34] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[02:51:04] <mutante>	 !log relforge1001 has huge /var/log/elastichsearch/relforge-eqiad_feature.log that wrote GBs just today but then stopped
[02:51:07] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[02:51:46] <mutante>	 ebernhardson: ^ do you know that file ?
[02:52:00] <mutante>	 relforge-eqiad_feature.log
[02:57:18] <mutante>	 i see a "root" is watching that file with "pv" so somebody is aware.. good enough
[03:01:38] <icinga-wm>	 PROBLEM - puppet last run on sca2003 is CRITICAL: CRITICAL: Puppet has 27 failures. Last run 2 minutes ago with 27 failures. Failed resources (up to 3 shown): Exec[eth0_v6_token],Package[wipe],Package[zotero/translators],Package[zotero/translation-server]
[03:02:07] <wikibugs>	 (03PS1) 10Dzahn: move ganglia aggregator eqiad from carbon to install1001 [puppet] - 10https://gerrit.wikimedia.org/r/328450 (https://phabricator.wikimedia.org/T132757)
[03:03:05] <logmsgbot>	 !log dzahn@puppetmaster1001 conftool action : set/pooled=yes; selector: name=mw1169.eqiad.wmnet
[03:03:07] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[03:03:56] <mutante>	 afk
[03:05:25] <wikibugs>	 06Operations, 06Commons, 10TimedMediaHandler-Transcode, 10Wikimedia-Video, and 2 others: Commons video transcoders have over 6500 tasks in the backlog. - https://phabricator.wikimedia.org/T153488#2882187 (10Dzahn) mw1168 and mw1169 have been reinstalled with the new partman recipe and now have more space i...
[03:06:17] <wikibugs>	 (03PS1) 10Tim Landscheidt: Tools: Fully qualify hostnames [puppet] - 10https://gerrit.wikimedia.org/r/328451 (https://phabricator.wikimedia.org/T153608)
[03:16:59] <wikibugs>	 (03PS1) 10Tim Landscheidt: Tools: Remove redundant tools-db entry from /etc/hosts [puppet] - 10https://gerrit.wikimedia.org/r/328453 (https://phabricator.wikimedia.org/T139190)
[03:29:38] <icinga-wm>	 RECOVERY - puppet last run on sca2003 is OK: OK: Puppet is currently enabled, last run 49 seconds ago with 0 failures
[03:31:16] <wikibugs>	 (03PS1) 10Tim Landscheidt: deployment-prep: Fully qualify hostnames [puppet] - 10https://gerrit.wikimedia.org/r/328455 (https://phabricator.wikimedia.org/T153608)
[03:34:18] <icinga-wm>	 PROBLEM - puppet last run on sca1004 is CRITICAL: CRITICAL: Puppet has 27 failures. Last run 2 minutes ago with 27 failures. Failed resources (up to 3 shown): Exec[eth0_v6_token],Package[wipe],Package[zotero/translators],Package[zotero/translation-server]
[03:35:23] <wikibugs>	 (03PS1) 10Tim Landscheidt: staging: Fully qualify hostname [puppet] - 10https://gerrit.wikimedia.org/r/328456 (https://phabricator.wikimedia.org/T153608)
[03:38:10] <wikibugs>	 (03PS1) 10Tim Landscheidt: trebuchet: Fully qualify hostname [puppet] - 10https://gerrit.wikimedia.org/r/328457 (https://phabricator.wikimedia.org/T153608)
[04:03:28] <icinga-wm>	 RECOVERY - puppet last run on sca1004 is OK: OK: Puppet is currently enabled, last run 51 seconds ago with 0 failures
[04:34:28] <icinga-wm>	 PROBLEM - puppet last run on labnodepool1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[04:36:36] <Niharika>	 !log commtech Added samwilson as project admin
[04:36:39] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[04:39:48] <icinga-wm>	 PROBLEM - puppet last run on sca2004 is CRITICAL: CRITICAL: Puppet has 27 failures. Last run 2 minutes ago with 27 failures. Failed resources (up to 3 shown): Exec[eth0_v6_token],Package[wipe],Package[zotero/translators],Package[zotero/translation-server]
[04:48:45] <wikibugs>	 (03PS19) 10Yuvipanda: labs: maintain-dbusers.py for maintaining labsdb users [puppet] - 10https://gerrit.wikimedia.org/r/327157
[04:51:54] <wikibugs>	 (03CR) 10Yuvipanda: [C: 032] labs: maintain-dbusers.py for maintaining labsdb users [puppet] - 10https://gerrit.wikimedia.org/r/327157 (owner: 10Yuvipanda)
[04:55:25] <wikibugs>	 (03PS1) 10Yuvipanda: labs: Fix service unit for maintani-dbusers [puppet] - 10https://gerrit.wikimedia.org/r/328459
[04:56:18] <icinga-wm>	 PROBLEM - Check systemd state on labstore1004 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed.
[04:57:52] <wikibugs>	 (03CR) 10Yuvipanda: [C: 032] labs: Fix service unit for maintani-dbusers [puppet] - 10https://gerrit.wikimedia.org/r/328459 (owner: 10Yuvipanda)
[04:59:15] <icinga-wm>	 RECOVERY - Check systemd state on labstore1004 is OK: OK - running: The system is fully operational
[04:59:25] <icinga-wm>	 RECOVERY - Ensure mysql credential creation for tools users is running on labstore1004 is OK: OK - maintain-dbusers is active
[05:02:25] <icinga-wm>	 RECOVERY - puppet last run on labnodepool1001 is OK: OK: Puppet is currently enabled, last run 37 seconds ago with 0 failures
[05:07:55] <icinga-wm>	 RECOVERY - puppet last run on sca2004 is OK: OK: Puppet is currently enabled, last run 37 seconds ago with 0 failures
[05:11:13] <wikibugs>	 (03PS1) 10Yuvipanda: labsdbs: Fixup delete-dbusers [puppet] - 10https://gerrit.wikimedia.org/r/328460
[05:11:42] <tto>	 https://meta.wikimedia.org/wiki/Special:GlobalUserRights gives me [WFoPAgpAMFMAAAzgMyEAAABC] 2016-12-21 05:11:30: Fatal exception of type MWException
[05:11:55] <wikibugs>	 (03CR) 10Yuvipanda: [V: 032 C: 032] labsdbs: Fixup delete-dbusers [puppet] - 10https://gerrit.wikimedia.org/r/328460 (owner: 10Yuvipanda)
[05:12:15] <icinga-wm>	 PROBLEM - Check systemd state on labstore1004 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed.
[05:12:25] <icinga-wm>	 PROBLEM - Ensure mysql credential creation for tools users is running on labstore1004 is CRITICAL: CRITICAL - Expecting active but unit maintain-dbusers is failed
[05:13:15] <icinga-wm>	 PROBLEM - Check systemd state on labstore1005 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed.
[05:15:15] <icinga-wm>	 RECOVERY - Check systemd state on labstore1004 is OK: OK - running: The system is fully operational
[05:15:25] <icinga-wm>	 RECOVERY - Ensure mysql credential creation for tools users is running on labstore1004 is OK: OK - maintain-dbusers is active
[05:29:39] <icinga-wm>	 PROBLEM - Ensure mysql credential creation for tools users is running on labstore1005 is CRITICAL: CRITICAL - Expecting active but unit maintain-dbusers is failed
[05:30:40] <yuvipanda>	 (am playing with ^ now)
[05:30:56] <yuvipanda>	 hmm I need to have it shut up on 1005
[05:39:19] <icinga-wm>	 RECOVERY - Check systemd state on labstore1005 is OK: OK - running: The system is fully operational
[05:39:39] <icinga-wm>	 RECOVERY - Ensure mysql credential creation for tools users is running on labstore1005 is OK: OK - maintain-dbusers is active
[05:49:19] <icinga-wm>	 PROBLEM - mailman I/O stats on fermium is CRITICAL: CRITICAL - I/O stats: Transfers/Sec=4858.80 Read Requests/Sec=468.60 Write Requests/Sec=0.90 KBytes Read/Sec=32574.40 KBytes_Written/Sec=28.80
[05:57:29] <icinga-wm>	 PROBLEM - Disk space on labtestnet2001 is CRITICAL: DISK CRITICAL - free space: / 350 MB (3% inode=46%)
[06:01:19] <icinga-wm>	 RECOVERY - mailman I/O stats on fermium is OK: OK - I/O stats: Transfers/Sec=24.70 Read Requests/Sec=7.30 Write Requests/Sec=2.60 KBytes Read/Sec=29.60 KBytes_Written/Sec=76.40
[06:12:19] <wikibugs>	 06Operations, 10Annual-Report: add subdomain for annual report 2016 - https://phabricator.wikimedia.org/T151798#2892108 (10Aklapper) >>! In T151798#2828771, @Dzahn wrote: > https://annual.wikimedia.org/2016/ > Does this resolve the ticket to create the URL? > Or did you want to keep it open until the real cont...
[06:26:12] <wikibugs>	 06Operations, 10Annual-Report: add subdomain for annual report 2016 - https://phabricator.wikimedia.org/T151798#2892140 (10Dzahn) a:03ZMcCune @ZMcCune let me know if you have questions about uploading the content to gerrit. i can help with that if needed
[06:29:28] <wikibugs>	 06Operations, 10OCG-General, 06Wiktionary, 13Patch-For-Review: Download as PDF does not work in English Wiktionary: "There was an error while attempting to render your book." - https://phabricator.wikimedia.org/T150604#2892150 (10Aklapper) 05Open>03Resolved a:03Aklapper All patches merged, hence assu...
[06:30:01] <wikibugs>	 (03PS1) 10Dzahn: planet: remove wikimedia.org.au feed [puppet] - 10https://gerrit.wikimedia.org/r/328465 (https://phabricator.wikimedia.org/T133620)
[06:30:51] <wikibugs>	 06Operations, 10IDS-extension, 10Wikimedia-Extension-setup, 07I18n: Deploy IDS rendering engine to production - https://phabricator.wikimedia.org/T148693#2892160 (10Samwilson)
[06:36:52] <wikibugs>	 (03PS2) 10Dzahn: planet: remove wikimedia.org.au feed [puppet] - 10https://gerrit.wikimedia.org/r/328465 (https://phabricator.wikimedia.org/T133620)
[06:37:20] <wikibugs>	 (03PS3) 10Dzahn: planet: remove wikimedia.org.au feed [puppet] - 10https://gerrit.wikimedia.org/r/328465 (https://phabricator.wikimedia.org/T133620)
[06:38:44] <wikibugs>	 (03CR) 10Dzahn: [C: 032] planet: remove wikimedia.org.au feed [puppet] - 10https://gerrit.wikimedia.org/r/328465 (https://phabricator.wikimedia.org/T133620) (owner: 10Dzahn)
[06:43:37] <wikibugs>	 (03PS1) 10Tim Landscheidt: apache: Fix some issues with apache::static_site [puppet] - 10https://gerrit.wikimedia.org/r/328466 (https://phabricator.wikimedia.org/T153816)
[06:43:39] <wikibugs>	 (03PS1) 10Tim Landscheidt: [WIP] aptly: Make aptly work with Apache [puppet] - 10https://gerrit.wikimedia.org/r/328467 (https://phabricator.wikimedia.org/T153814)
[06:44:35] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] apache: Fix some issues with apache::static_site [puppet] - 10https://gerrit.wikimedia.org/r/328466 (https://phabricator.wikimedia.org/T153816) (owner: 10Tim Landscheidt)
[06:46:59] <wikibugs>	 (03PS2) 10Tim Landscheidt: apache: Fix some issues with apache::static_site [puppet] - 10https://gerrit.wikimedia.org/r/328466 (https://phabricator.wikimedia.org/T153816)
[06:47:01] <wikibugs>	 (03PS2) 10Tim Landscheidt: [WIP] aptly: Make aptly work with Apache [puppet] - 10https://gerrit.wikimedia.org/r/328467 (https://phabricator.wikimedia.org/T153814)
[06:47:57] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] apache: Fix some issues with apache::static_site [puppet] - 10https://gerrit.wikimedia.org/r/328466 (https://phabricator.wikimedia.org/T153816) (owner: 10Tim Landscheidt)
[06:50:29] <icinga-wm>	 RECOVERY - Disk space on labtestnet2001 is OK: DISK OK
[06:55:03] <wikibugs>	 (03CR) 10Tim Landscheidt: [C: 04-1] "Missed the possibilities of array *and* string for ldap_groups, and need to align arrows." [puppet] - 10https://gerrit.wikimedia.org/r/328466 (https://phabricator.wikimedia.org/T153816) (owner: 10Tim Landscheidt)
[06:56:29] <icinga-wm>	 PROBLEM - puppet last run on stat1004 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[vim]
[07:18:39] <icinga-wm>	 PROBLEM - MariaDB Slave SQL: m2 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[07:18:39] <icinga-wm>	 PROBLEM - MariaDB Slave IO: s6 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[07:18:39] <icinga-wm>	 PROBLEM - MariaDB Slave SQL: s5 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[07:18:39] <icinga-wm>	 PROBLEM - MariaDB Slave IO: s3 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[07:18:39] <icinga-wm>	 PROBLEM - MariaDB Slave IO: s5 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[07:18:40] <icinga-wm>	 PROBLEM - MariaDB Slave IO: m2 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[07:18:40] <icinga-wm>	 PROBLEM - MariaDB Slave IO: m3 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[07:18:41] <icinga-wm>	 PROBLEM - MariaDB Slave IO: s2 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[07:18:41] <icinga-wm>	 PROBLEM - MariaDB Slave IO: s1 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[07:18:42] <icinga-wm>	 PROBLEM - MariaDB Slave SQL: s3 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[07:18:42] <icinga-wm>	 PROBLEM - MariaDB Slave SQL: s6 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[07:18:42] <icinga-wm>	 PROBLEM - MariaDB Slave SQL: x1 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[07:18:49] <marostegui>	 ^ checking
[07:19:29] <icinga-wm>	 RECOVERY - MariaDB Slave IO: m2 on dbstore1001 is OK: OK slave_io_state not a slave
[07:19:29] <icinga-wm>	 RECOVERY - MariaDB Slave SQL: m2 on dbstore1001 is OK: OK slave_sql_state not a slave
[07:19:29] <icinga-wm>	 RECOVERY - MariaDB Slave IO: s3 on dbstore1001 is OK: OK slave_io_state Slave_IO_Running: Yes
[07:19:29] <icinga-wm>	 RECOVERY - MariaDB Slave IO: s6 on dbstore1001 is OK: OK slave_io_state Slave_IO_Running: Yes
[07:19:29] <icinga-wm>	 RECOVERY - MariaDB Slave SQL: s5 on dbstore1001 is OK: OK slave_sql_state Slave_SQL_Running: No, (no error: intentional)
[07:19:30] <icinga-wm>	 RECOVERY - MariaDB Slave IO: s5 on dbstore1001 is OK: OK slave_io_state Slave_IO_Running: Yes
[07:19:30] <icinga-wm>	 RECOVERY - MariaDB Slave IO: s2 on dbstore1001 is OK: OK slave_io_state Slave_IO_Running: Yes
[07:19:31] <icinga-wm>	 RECOVERY - MariaDB Slave IO: m3 on dbstore1001 is OK: OK slave_io_state Slave_IO_Running: Yes
[07:19:31] <icinga-wm>	 RECOVERY - MariaDB Slave IO: s1 on dbstore1001 is OK: OK slave_io_state Slave_IO_Running: Yes
[07:19:32] <icinga-wm>	 RECOVERY - MariaDB Slave SQL: s6 on dbstore1001 is OK: OK slave_sql_state Slave_SQL_Running: No, (no error: intentional)
[07:19:32] <icinga-wm>	 RECOVERY - MariaDB Slave IO: s7 on dbstore1001 is OK: OK slave_io_state Slave_IO_Running: Yes
[07:19:32] <icinga-wm>	 RECOVERY - MariaDB Slave SQL: s3 on dbstore1001 is OK: OK slave_sql_state Slave_SQL_Running: No, (no error: intentional)
[07:20:23] <marostegui>	 It took ages for me to run show slave 's1' status as the server was a bit loaded, so I guess that is why it alerted. There are several backups being done at the moment so the server is a bit loaded
[07:20:42] <marostegui>	 !log Running optimize table on db1045 for the revision tables as we urgently need some space back on that host - https://phabricator.wikimedia.org/T153739
[07:20:45] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:23:29] <icinga-wm>	 RECOVERY - puppet last run on stat1004 is OK: OK: Puppet is currently enabled, last run 29 seconds ago with 0 failures
[07:46:56] <wikibugs>	 06Operations, 10Parsoid, 06Services: Investigate 404s for batch request api calls after async processing moved from eqiad -> codfw - https://phabricator.wikimedia.org/T153797#2892280 (10Joe) >>! In T153797#2891580, @Pchelolo wrote: > Another interesting type of 404s from parsoid started to appear after move...
[07:56:31] <wikibugs>	 (03PS1) 10Elukey: Increase hhvm threads and transcode capabilities on mw116[89] [puppet] - 10https://gerrit.wikimedia.org/r/328473 (https://phabricator.wikimedia.org/T153488)
[07:59:25] <wikibugs>	 (03CR) 10Elukey: "https://puppet-compiler.wmflabs.org/4957/ looks good" [puppet] - 10https://gerrit.wikimedia.org/r/328473 (https://phabricator.wikimedia.org/T153488) (owner: 10Elukey)
[08:01:40] <marostegui>	 !log Running optimize table on db1044 for the pagelinks tables as we urgently need some space back on that host - T153826
[08:01:43] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:01:44] <stashbot>	 T153826: Defragment db1044 - https://phabricator.wikimedia.org/T153826
[08:26:54] <elukey>	 Revent: o/ - mw116[89] should handle a bit more load now, let's see if the queue improves 
[08:27:03] <Revent>	 Yay.
[08:27:27] <elukey>	 https://grafana.wikimedia.org/dashboard/file/server-board.json?var-server=mw1169
[08:28:42] <Revent>	 elukey: Something that was done, at some point, dumped a ton of files (thousands) out as having a “transcode_time_addjob”, a “transcode_time_error”, and no “transcode_time_startwork”….
[08:29:32] <Revent>	 The system appears to see them as queued, and runs them, but they end up with a ‘negative’ encoding time (they have a success time, an error time, and no start time)
[08:29:54] <elukey>	 Revent: do you mean during the past couple of days or way before?
[08:30:01] <Revent>	 Recently.
[08:30:33] <Revent>	 That same DB query had previously only showed files that I had ‘intentionally’ broken by resetting them while running.
[08:31:03] <elukey>	 maybe that was me restarting the jobrunners
[08:31:23] <Revent>	 I’ve been going through the list, and resetting them to make them have a consistent state in the DB…. it doesn’t ‘add them’ to the queue (they is already queued) but makes them end up with the correct status.
[08:31:43] <Revent>	 Yeah, that’s what I suspect, that they were ‘supposedly’ started by the bug, but not really.
[08:32:26] <Revent>	 Anyhow, I’ve fixing them… I rather suspect that these are the files that are being ‘run’ but not showing up as running in the timedmediahandler special page.
[08:33:46] <elukey>	 so I am seeing Apache fcgi errors now, and 503s logged
[08:33:47] <elukey>	 sigh
[08:35:02] <Revent>	 It’s worth noting that the ‘thousands’ I’m referring to all failed at “Error on 12:49, 2016 December 19” or a minute later.
[08:37:39] <wikibugs>	 06Operations, 10Parsoid, 06Services: Investigate 404s for batch request api calls after async processing moved from eqiad -> codfw - https://phabricator.wikimedia.org/T153797#2892463 (10Joe) So, about the unconfigured domain ones:  it seems that parsoid sometimes sends out a request without the `Host:` heade...
[08:42:49] <elukey>	 !log restarted hhvm/jobrunner (and killed ffmpeg processes) on mw116[89] 
[08:42:51] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:43:09] <elukey>	 Revent: --^ this might trigger errors, but let's see if it helps.. 
[08:44:40] <wikibugs>	 07Puppet, 13Patch-For-Review: apache::static_site is not working - https://phabricator.wikimedia.org/T153816#2892485 (10Peachey88)
[08:45:23] <Revent>	 elukey: I’m vaguely guessing that errors won’t actually show up until the timeout expires...
[08:58:59] <elukey>	 ok so I am live hacking on mw1168 to figure out why hhvm truncates connections with httpd
[08:59:48] <wikibugs>	 (03PS1) 10Muehlenhoff: Also follow stat1001 rename in debdeploy grains [puppet] - 10https://gerrit.wikimedia.org/r/328478
[09:02:52] <wikibugs>	 (03PS5) 10Muehlenhoff: Make systemd-timesyncd available as an alternative time synchronisation provider (WIP) [puppet] - 10https://gerrit.wikimedia.org/r/322279 (https://phabricator.wikimedia.org/T150257)
[09:04:55] <elukey>	 I found the issue, sigh
[09:07:52] <wikibugs>	 (03PS1) 10Elukey: Add hhvm timeouts overrides to hiera for mw116[89] [puppet] - 10https://gerrit.wikimedia.org/r/328479 (https://phabricator.wikimedia.org/T153488)
[09:10:32] <wikibugs>	 (03CR) 10Elukey: [C: 032] Add hhvm timeouts overrides to hiera for mw116[89] [puppet] - 10https://gerrit.wikimedia.org/r/328479 (https://phabricator.wikimedia.org/T153488) (owner: 10Elukey)
[09:11:15] <wikibugs>	 06Operations, 10Parsoid, 06Services: Investigate 404s for batch request api calls after async processing moved from eqiad -> codfw - https://phabricator.wikimedia.org/T153797#2892522 (10Joe) After further inspection of the logs, it seems that those are the only 404 errors we get, apart from a few when trying...
[09:12:18] <Revent>	 elukey: Greatly appreciate you guys working on this… it obviously needed the love, lol.
[09:14:51] <elukey>	 Revent: same thing for your work :)
[09:14:59] <elukey>	 (I mean appreciated!)
[09:15:23] * elukey tails logs on mw116[89] waiting for good news
[09:15:36] <_joe_>	 elukey: what's the issue?
[09:15:46] <_joe_>	 oh I see
[09:15:47] <_joe_>	 sigh
[09:19:48] <elukey>	 _joe_ it was an issue between me and puppet, that thing always makes fun of me
[09:21:37] <elukey>	 on the bright side, load is still good so we might think of raising the runners a bit more later on
[09:21:44] <_joe_>	 I'
[09:22:02] <_joe_>	 I would suggest everyone to let the system work for a few days without interventions
[09:22:12] <_joe_>	 or we won't be able to understand what's happening
[09:22:23] <_joe_>	 and we need to ping the developers too, elukey 
[09:22:56] <elukey>	 _joe_ yes you are right :)
[09:27:48] <Revent>	 _joe_: Hey.
[09:28:15] <_joe_>	 Revent: as you might guess from the tickets comments, I'm working on something else today :/
[09:28:36] <Revent>	 Just to be clear… there are ‘lots’ of queued tasks that have a ‘messed up’ status… they are queued, but have an error time in the DB.
[09:29:16] <Revent>	 If run without being reset first, they end up with a ‘weird’ status (both an error time and a success time)
[09:29:46] <Revent>	 Hopefully, I can still work on resetting those (before they are run) so that when run they’ll end up with the right status.
[09:30:20] <Revent>	 I’m not resetting anything that’s actually ‘running’ anymore.
[09:30:29] <icinga-wm>	 PROBLEM - puppet last run on analytics1026 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[09:41:21] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C: 032] package_builder: rebuild Packages only when needed [puppet] - 10https://gerrit.wikimedia.org/r/328221 (owner: 10Filippo Giunchedi)
[09:41:26] <wikibugs>	 (03PS2) 10Alexandros Kosiaris: package_builder: rebuild Packages only when needed [puppet] - 10https://gerrit.wikimedia.org/r/328221 (owner: 10Filippo Giunchedi)
[09:41:29] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [V: 032 C: 032] package_builder: rebuild Packages only when needed [puppet] - 10https://gerrit.wikimedia.org/r/328221 (owner: 10Filippo Giunchedi)
[09:46:33] <wikibugs>	 06Operations, 10media-storage, 13Patch-For-Review: cronspam cleanup: Cron <root@ms-be2*> test -x /usr/sbin/anacron || ( cd / && run-parts --report /etc/cron.monthly ) - https://phabricator.wikimedia.org/T152440#2892599 (10MoritzMuehlenhoff) A fix for this is pending with the next jessie point release, which...
[09:58:14] <wikibugs>	 (03PS1) 10Reedy: 3 more to extension.json in extension-list [mediawiki-config] - 10https://gerrit.wikimedia.org/r/328482 (https://phabricator.wikimedia.org/T139800)
[09:58:29] <icinga-wm>	 RECOVERY - puppet last run on analytics1026 is OK: OK: Puppet is currently enabled, last run 16 seconds ago with 0 failures
[09:58:33] <jynus>	 !log extending db1035 /srv partition
[09:58:35] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:01:18] <moritzm>	 !log installing libgme security updates
[10:01:19] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:02:04] <wikibugs>	 (03PS1) 10Reedy: Use wfLoadExtension for 3 more extensions too [mediawiki-config] - 10https://gerrit.wikimedia.org/r/328484 (https://phabricator.wikimedia.org/T140852)
[10:35:39] <wikibugs>	 (03PS2) 10Marostegui: misc.my.cnf.erb: Enable barracuda and innodb_strict_mode [puppet] - 10https://gerrit.wikimedia.org/r/321638 (https://phabricator.wikimedia.org/T150949)
[10:43:19] <jynus>	 !log dropping non-wiki databases from labsdb1001
[10:43:21] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:49:26] <wikibugs>	 06Operations, 10TimedMediaHandler, 10hardware-requests: Assign 3 more servers to video scaler duty - https://phabricator.wikimedia.org/T114337#1691895 (10elukey) In T153488 we repurposed two jobrunners to videoscalers (mw116[89]), so now the total eqiad cluster is 4. We spent a bit of time solving an apache<...
[10:54:39] <icinga-wm>	 PROBLEM - puppet last run on sca1003 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[10:57:24] <wikibugs>	 (03PS2) 10Elukey: Addin Eric Evans to analytics-privatedata for stat1002 [puppet] - 10https://gerrit.wikimedia.org/r/328181 (https://phabricator.wikimedia.org/T153375) (owner: 10Cmjohnson)
[10:58:58] <wikibugs>	 (03PS3) 10Elukey: Add Eric Evans to analytics-privatedata [puppet] - 10https://gerrit.wikimedia.org/r/328181 (https://phabricator.wikimedia.org/T153375) (owner: 10Cmjohnson)
[11:00:11] <wikibugs>	 (03CR) 10Elukey: [C: 032] Add Eric Evans to analytics-privatedata [puppet] - 10https://gerrit.wikimedia.org/r/328181 (https://phabricator.wikimedia.org/T153375) (owner: 10Cmjohnson)
[11:06:02] <wikibugs>	 06Operations, 10Ops-Access-Requests, 13Patch-For-Review: Requesting access to stat1002.eqiad.wmnet for eevans - https://phabricator.wikimedia.org/T153375#2892947 (10elukey) 05Open>03Resolved a:03elukey Ran puppet on stat100[24], @Eevans you can now ssh and your username is in the `analytics-privatedata...
[11:22:39] <icinga-wm>	 RECOVERY - puppet last run on sca1003 is OK: OK: Puppet is currently enabled, last run 43 seconds ago with 0 failures
[11:37:39] <icinga-wm>	 PROBLEM - OTRS SMTP on mendelevium is CRITICAL: connect to address 10.64.32.174 and port 25: Connection refused
[11:41:49] <icinga-wm>	 PROBLEM - spamassassin on mendelevium is CRITICAL: PROCS CRITICAL: 0 processes with args spamd
[11:48:16] <wikibugs>	 (03PS1) 10Giuseppe Lavagetto: tlsproxy::localssl: add ability to have an access.log [puppet] - 10https://gerrit.wikimedia.org/r/328495 (https://phabricator.wikimedia.org/T153797)
[11:49:49] <icinga-wm>	 RECOVERY - spamassassin on mendelevium is OK: PROCS OK: 1 process with args spamd
[11:50:39] <icinga-wm>	 RECOVERY - OTRS SMTP on mendelevium is OK: SMTP OK - 0.003 sec. response time
[11:53:23] <wikibugs>	 06Operations, 10Parsoid, 13Patch-For-Review, 06Services (doing), 15User-mobrovac: Investigate 404s for batch request api calls after async processing moved from eqiad -> codfw - https://phabricator.wikimedia.org/T153797#2893086 (10mobrovac) p:05Triage>03High
[11:59:37] <wikibugs>	 06Operations, 06Discovery, 10Traffic, 10Wikidata, and 2 others: Consider switching to HTTPS for Wikidata query service links - https://phabricator.wikimedia.org/T153563#2893089 (10Esc3300) I think it would be good to see some metrics. How many people use these links? Is this already available on grafana?
[12:13:57] <Revent>	 elukey: Still around?
[12:15:38] <Revent>	 Or _joe_ really, just want to comment.
[12:18:08] <wikibugs>	 (03PS1) 10Muehlenhoff: Add retroactively assigned CVE ID [debs/linux44] - 10https://gerrit.wikimedia.org/r/328498
[12:18:26] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 032] Add retroactively assigned CVE ID [debs/linux44] - 10https://gerrit.wikimedia.org/r/328498 (owner: 10Muehlenhoff)
[12:19:16] <Revent>	 Meh, you’ll likely see it later. The Special page is showing about 90 transcodes running at a time, which is an appropriate number for the CPU count. There do not seem to be tasks showing up as ‘failed’, and they are completing.
[12:21:16] <Revent>	 I’m methodically ‘resetting’ transcodes shown on the ‘special page’ as queued, but that show on the file page as ‘error’… these all have a couple of specific times shown for the error that seem to directly connect to server resets… resetting them is just so they end up with a ‘sane’ status when run, instead of negative times or other oddness.
[12:22:11] <Revent>	 There are still a ton of the ‘messed up’ ones, but the rate of them completing with the ‘broken’ status (both a success and an error time) seems to be going down.
[12:23:24] <Revent>	 (resetting these does not make the count on the special page of ‘queued’ OR ‘failed’ transcodes change… they just have a broken state in the DB)
[12:23:30] * elukey reads
[12:25:15] <elukey>	 Revent: is there a phab task to track this behavior? If not, it would be really great if you could (whenever you have time) open it and add me in CC
[12:25:22] <Revent>	 I think the issue with ‘running’ transcodes that are not shown as ‘running’ on the special page is just due to ones with that ‘broken’ status.
[12:25:39] <Revent>	 Not specifically for this, I don’t think.
[12:26:20] <elukey>	 no I meant for the ones that are you resetting..
[12:26:33] <Revent>	 Ah, nope.
[12:26:37] <elukey>	 you shouldn't have to do it, it feels wrong :D
[12:26:46] <Revent>	 Yeah, I know...
[12:26:59] <elukey>	 maybe it is a bug that we can solve easily
[12:27:06] <elukey>	 we == ops and devs
[12:27:18] <Revent>	 To make it clear, I had written this query...
[12:27:19] <elukey>	 or anybody that has experience with the php code
[12:27:31] <elukey>	 sure sure :)
[12:27:46] <elukey>	 this is the current status now from terbium
[12:27:47] <elukey>	 webVideoTranscode: 15434 queued; 805 claimed (199 active, 606 abandoned); 0 delayed
[12:27:59] <Revent>	 https://quarry.wmflabs.org/query/14861 <- the description is related to what I was originally using it for, it showed the ones that I had kicked off the queue by resetting them while running.
[12:28:26] <Revent>	 It was around a hundred or two.
[12:28:48] <Revent>	 When you reset the servers, after fixing the timeout, it dumped about 5k on there.
[12:29:07] <elukey>	 :(
[12:29:21] <elukey>	 definitely we need a phab task to see it if is a bug or a feature :D
[12:29:42] <elukey>	 I am going afk for a bit, but feel free to write / ping me, I'll read later on!
[12:29:52] <Revent>	 I think what needs to be fixed is that when a transcode completes successfully, it explicitly clears a previously existing transcode_error and and transcode_time_error
[12:30:39] <icinga-wm>	 PROBLEM - puppet last run on sca2003 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[12:30:53] <Revent>	 But… I don’t think it’s something that would occur when the system was running normally.
[12:31:42] <Revent>	 (it’s not something I ever saw before my ‘reset while running’, or your server reset)
[12:33:19] <Revent>	 The error times also seem to be tightly clustered around 12:40 and 13:30 on the 19th, which I believe is when you reset the two servers.
[12:34:10] <Revent>	 If the system was, due to the insane timeout, trying to start thousands of tasks that never actually ‘started’ because the servers were out of ram, this would make sense.
[12:40:39] <wikibugs>	 06Operations, 06Commons, 10TimedMediaHandler, 10Wikimedia-Video: Creation of derivative video files is not working - https://phabricator.wikimedia.org/T153852#2893187 (10Pokefan95)
[12:42:09] <wikibugs>	 (03PS1) 10Urbanecm: Enable SandboxLink on ruwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/328502 (https://phabricator.wikimedia.org/T153855)
[12:44:50] <wikibugs>	 (03PS1) 10Muehlenhoff: Another retroactive CVE assignment [debs/linux44] - 10https://gerrit.wikimedia.org/r/328503
[12:45:23] <wikibugs>	 06Operations, 06Commons, 10TimedMediaHandler, 10Wikimedia-Video: Creation of derivative video files is not working - https://phabricator.wikimedia.org/T153852#2893074 (10Revent) This is known, and being worked. It's simply that the backlog became extremely large due to a high number of 'huge' (1920P, and a...
[12:46:52] <moritzm>	 !log install openjdk-6 security update on labsdb1006
[12:46:54] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:48:22] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 032] Another retroactive CVE assignment [debs/linux44] - 10https://gerrit.wikimedia.org/r/328503 (owner: 10Muehlenhoff)
[12:56:14] <wikibugs>	 06Operations, 06Commons, 10TimedMediaHandler, 10Wikimedia-Video: Creation of derivative video files is not working - https://phabricator.wikimedia.org/T153852#2893223 (10Pristurus) Okay, thank you very much for this information.
[12:57:28] <logmsgbot>	 !log mobrovac@tin Starting deploy [parsoid/deploy@dab1f27]: Bug fix for mwApiServer T153797
[12:57:32] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:57:33] <stashbot>	 T153797: Investigate 404s for batch request api calls after async processing moved from eqiad -> codfw - https://phabricator.wikimedia.org/T153797
[12:58:05] <wikibugs>	 (03PS1) 10Muehlenhoff: Fix CVE ID for exception table privilege escalation [debs/linux44] - 10https://gerrit.wikimedia.org/r/328507
[12:59:39] <icinga-wm>	 RECOVERY - puppet last run on sca2003 is OK: OK: Puppet is currently enabled, last run 37 seconds ago with 0 failures
[13:03:01] <logmsgbot>	 !log mobrovac@tin Finished deploy [parsoid/deploy@dab1f27]: Bug fix for mwApiServer T153797 (duration: 05m 32s)
[13:03:04] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:03:04] <stashbot>	 T153797: Investigate 404s for batch request api calls after async processing moved from eqiad -> codfw - https://phabricator.wikimedia.org/T153797
[13:07:13] <wikibugs>	 (03PS1) 10Alexandros Kosiaris: Introduce dbmonitor1001, dbmonitor2001 [puppet] - 10https://gerrit.wikimedia.org/r/328509 (https://phabricator.wikimedia.org/T149557)
[13:08:08] <wikibugs>	 (03PS1) 10Muehlenhoff: Update to 4.4.33 [debs/linux44] - 10https://gerrit.wikimedia.org/r/328510
[13:09:33] <moritzm>	 !log install hdf5 security updates
[13:09:35] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:21:10] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 032] Fix CVE ID for exception table privilege escalation [debs/linux44] - 10https://gerrit.wikimedia.org/r/328507 (owner: 10Muehlenhoff)
[13:22:09] <wikibugs>	 06Operations, 10Parsoid, 13Patch-For-Review, 06Services (doing), 15User-mobrovac: Investigate 404s for batch request api calls after async processing moved from eqiad -> codfw - https://phabricator.wikimedia.org/T153797#2893280 (10Joe) Errors have completely stopped after @mobrovac's patch has been added.
[13:32:29] <icinga-wm>	 PROBLEM - Disk space on db1035 is CRITICAL: DISK CRITICAL - free space: /srv 63672 MB (3% inode=99%)
[13:34:29] <jynus>	 that is me
[13:34:38] <jynus>	 it should reach 1%, then go down
[13:36:27] <jynus>	 it is depooled in any case
[13:36:30] <wikibugs>	 06Operations, 10Ops-Access-Requests, 10Reading-Web-Trending-Service, 06Services (done), 15User-mobrovac: Allow @Jdlrobson and @bearND to deploy and manage the trending edits service - https://phabricator.wikimedia.org/T153458#2893302 (10mobrovac)
[13:39:49] <icinga-wm>	 PROBLEM - puppet last run on cp3033 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[13:40:54] <wikibugs>	 06Operations, 10Parsoid, 06Services (done), 15User-mobrovac: Investigate 404s for batch request api calls after async processing moved from eqiad -> codfw - https://phabricator.wikimedia.org/T153797#2893307 (10mobrovac) 05Open>03Resolved No //batch request failure// errors for the last 40 minutes, so I...
[13:46:57] <wikibugs>	 (03CR) 10Marostegui: "https://puppet-compiler.wmflabs.org/4959/" [puppet] - 10https://gerrit.wikimedia.org/r/328352 (https://phabricator.wikimedia.org/T153680) (owner: 10Marostegui)
[13:47:53] <wikibugs>	 (03PS3) 10Ema: varnish cachestats.py: add support for defaults and key_prefix [puppet] - 10https://gerrit.wikimedia.org/r/328368 (https://phabricator.wikimedia.org/T151643)
[13:48:05] <wikibugs>	 (03CR) 10Ema: [V: 032 C: 032] varnish cachestats.py: add support for defaults and key_prefix [puppet] - 10https://gerrit.wikimedia.org/r/328368 (https://phabricator.wikimedia.org/T151643) (owner: 10Ema)
[13:49:29] <icinga-wm>	 PROBLEM - puppet last run on elastic1040 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[13:50:49] <icinga-wm>	 RECOVERY - puppet last run on cp3033 is OK: OK: Puppet is currently enabled, last run 38 seconds ago with 0 failures
[13:54:17] <wikibugs>	 (03PS2) 10Muehlenhoff: Update to 4.4.33 [debs/linux44] - 10https://gerrit.wikimedia.org/r/328510
[13:55:20] <wikibugs>	 (03CR) 10Jcrespo: [WIP] Reporting tests with the private data script (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/328352 (https://phabricator.wikimedia.org/T153680) (owner: 10Marostegui)
[13:57:22] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 032] Update to 4.4.33 [debs/linux44] - 10https://gerrit.wikimedia.org/r/328510 (owner: 10Muehlenhoff)
[13:57:43] <wikibugs>	 (03PS2) 10Muehlenhoff: Also follow stat1001 rename in debdeploy grains [puppet] - 10https://gerrit.wikimedia.org/r/328478
[14:03:08] <icinga-wm>	 ACKNOWLEDGEMENT - Disk space on relforge1001 is CRITICAL: DISK CRITICAL - free space: / 7699 MB (8% inode=97%): Gehel disk space is used by relforge-eqiad_feature.log, related to current investigation by Erik. Log is not growing at this time, waiting for Erik to have a look
[14:03:25] <gehel>	 ebernhardson: for when you are back ^
[14:07:08] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 032] Also follow stat1001 rename in debdeploy grains [puppet] - 10https://gerrit.wikimedia.org/r/328478 (owner: 10Muehlenhoff)
[14:08:14] <wikibugs>	 (03PS2) 10Ema: varnishxcache: port to cachestats.CacheStatsSender [puppet] - 10https://gerrit.wikimedia.org/r/328179 (https://phabricator.wikimedia.org/T151643)
[14:08:59] <icinga-wm>	 PROBLEM - puppet last run on sca2004 is CRITICAL: CRITICAL: Puppet has 7 failures. Last run 2 minutes ago with 7 failures. Failed resources (up to 3 shown): Package[tzdata],Service[zotero],Exec[zotero-admin_ensure_members],Exec[sc-admins_ensure_members]
[14:14:23] <wikibugs>	 06Operations, 06Commons, 10TimedMediaHandler, 10Wikimedia-Video: Creation of derivative video files is not working - https://phabricator.wikimedia.org/T153852#2893416 (10zhuyifei1999) 05Open>03Invalid Duplicate of {T153488} and {T153747}
[14:18:29] <icinga-wm>	 RECOVERY - puppet last run on elastic1040 is OK: OK: Puppet is currently enabled, last run 50 seconds ago with 0 failures
[14:28:05] <wikibugs>	 06Operations, 10Wikimedia-Stream: Upstream prematurely closed connection - https://phabricator.wikimedia.org/T153772#2893423 (10ema) Note that this issue has been present for a very long time and is unrelated to the package upgrade performed yesterday and mentioned in T153773.  The earliest occurrence of the m...
[14:28:41] <wikibugs>	 (03PS1) 10Elukey: Remove MongoDB dependency from statistics cruncher [puppet] - 10https://gerrit.wikimedia.org/r/328519
[14:28:49] <icinga-wm>	 PROBLEM - puppet last run on druid1002 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[14:29:22] <wikibugs>	 06Operations, 10Analytics, 10Traffic: A/B Testing solid framework - https://phabricator.wikimedia.org/T135762#2893426 (10Jan_Dittrich) Is that testing framework also planned to work with central notice/banners, or is that a separate infrastructure?
[14:29:51] <moritzm>	 !log installing imagemagick security updates
[14:29:53] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:37:59] <icinga-wm>	 RECOVERY - puppet last run on sca2004 is OK: OK: Puppet is currently enabled, last run 43 seconds ago with 0 failures
[14:40:41] <wikibugs>	 (03PS1) 10Giuseppe Lavagetto: changeprop: use codfw's restbase in codfw as well [puppet] - 10https://gerrit.wikimedia.org/r/328521
[14:40:47] <_joe_>	 mobrovac: ^^
[14:46:51] <wikibugs>	 (03CR) 10Mobrovac: "Should we rather just declare this once in role/common/scb.yaml since the value will always the same for both DCs?" [puppet] - 10https://gerrit.wikimedia.org/r/328521 (owner: 10Giuseppe Lavagetto)
[14:47:03] <mobrovac>	 _joe_: ^^ :)
[14:47:49] <icinga-wm>	 PROBLEM - puppet last run on mw2127 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 2 minutes ago with 2 failures. Failed resources (up to 3 shown): Package[imagemagick],Package[httpry]
[14:47:49] <icinga-wm>	 PROBLEM - puppet last run on mw2176 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[imagemagick]
[14:53:49] <icinga-wm>	 RECOVERY - puppet last run on mw2127 is OK: OK: Puppet is currently enabled, last run 32 seconds ago with 0 failures
[14:54:06] <moritzm>	 !log installing ghostscript security updates on trusty hosts
[14:54:08] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:56:49] <icinga-wm>	 RECOVERY - puppet last run on druid1002 is OK: OK: Puppet is currently enabled, last run 47 seconds ago with 0 failures
[14:59:45] <wikibugs>	 (03CR) 10Volans: "A nitpicking node inline ;)" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/328179 (https://phabricator.wikimedia.org/T151643) (owner: 10Ema)
[15:00:04] <wikibugs>	 (03CR) 10Elukey: [C: 032] "https://puppet-compiler.wmflabs.org/4960/" [puppet] - 10https://gerrit.wikimedia.org/r/328519 (owner: 10Elukey)
[15:02:06] <wikibugs>	 (03PS1) 10Muehlenhoff: Add snapshot::testbed to standard snapshot debdeploy group [puppet] - 10https://gerrit.wikimedia.org/r/328523
[15:04:15] <elukey>	 !log removed mongodb* packages from stat1003 after https://gerrit.wikimedia.org/r/328519
[15:04:18] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:05:29] <wikibugs>	 (03CR) 10Elukey: [V: 032 C: 032] Add another test to run Varnishkafka with Valgrind [software/varnish/varnishkafka/testing] - 10https://gerrit.wikimedia.org/r/328381 (https://phabricator.wikimedia.org/T147438) (owner: 10Elukey)
[15:05:39] <icinga-wm>	 PROBLEM - DPKG on snapshot1001 is CRITICAL: DPKG CRITICAL dpkg reports broken packages
[15:06:59] <icinga-wm>	 PROBLEM - Redis status tcp_6479 on rdb2006 is CRITICAL: CRITICAL ERROR - Redis Library - can not ping 10.192.48.44 on port 6479
[15:07:59] <icinga-wm>	 RECOVERY - Redis status tcp_6479 on rdb2006 is OK: OK: REDIS 2.8.17 on 10.192.48.44:6479 has 1 databases (db0) with 3987193 keys, up 51 days 6 hours - replication_delay is 41
[15:08:39] <icinga-wm>	 RECOVERY - DPKG on snapshot1001 is OK: All packages OK
[15:14:42] <wikibugs>	 (03PS1) 10Elukey: Increase Redis Replica Sync Nagios retry intreval to 2 minutes [puppet] - 10https://gerrit.wikimedia.org/r/328525
[15:15:16] <wikibugs>	 (03PS2) 10Elukey: Increase Redis Replica Sync Nagios retry intreval to 2 minutes [puppet] - 10https://gerrit.wikimedia.org/r/328525
[15:15:39] <elukey>	 _joe_ --^
[15:15:50] <icinga-wm>	 RECOVERY - puppet last run on mw2176 is OK: OK: Puppet is currently enabled, last run 15 seconds ago with 0 failures
[15:16:19] <wikibugs>	 (03PS3) 10Elukey: Increase Redis Replica Sync retry interval to 2 minutes [puppet] - 10https://gerrit.wikimedia.org/r/328525
[15:20:38] <_joe_>	 heh
[15:20:48] <_joe_>	 yes, I'll take a look elukey 
[15:22:52] <gehel>	 !log truncating /var/log/elasticsearch/relforge-eqiad_feature.log on relforge100[12]
[15:22:54] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:23:29] <icinga-wm>	 RECOVERY - Disk space on relforge1001 is OK: DISK OK
[15:29:33] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 032] Add snapshot::testbed to standard snapshot debdeploy group [puppet] - 10https://gerrit.wikimedia.org/r/328523 (owner: 10Muehlenhoff)
[15:31:29] <icinga-wm>	 RECOVERY - Disk space on db1035 is OK: DISK OK
[15:32:39] <icinga-wm>	 PROBLEM - puppet last run on mc1010 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[15:34:39] <icinga-wm>	 RECOVERY - puppet last run on mc1010 is OK: OK: Puppet is currently enabled, last run 35 seconds ago with 0 failures
[15:40:42] <wikibugs>	 (03PS16) 10Alexandros Kosiaris: Add profile::kubernetes::node profile class [puppet] - 10https://gerrit.wikimedia.org/r/324212
[15:40:44] <wikibugs>	 (03PS15) 10Alexandros Kosiaris: Include ::profile::kubernetes::node in role::kubernetes::worker [puppet] - 10https://gerrit.wikimedia.org/r/324213
[15:40:46] <wikibugs>	 (03PS3) 10Alexandros Kosiaris: kubernetes::master: Introduce the kubernetes profile [puppet] - 10https://gerrit.wikimedia.org/r/328174
[15:40:48] <wikibugs>	 (03PS3) 10Alexandros Kosiaris: Create and assign the kubernetes::master role [puppet] - 10https://gerrit.wikimedia.org/r/328175
[15:47:25] <wikibugs>	 (03PS1) 10Muehlenhoff: Update to 4.4.34 [debs/linux44] - 10https://gerrit.wikimedia.org/r/328529
[15:49:48] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: mediawiki::scaler: check orphaned HHVM threads (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/328194 (https://phabricator.wikimedia.org/T153488) (owner: 10Giuseppe Lavagetto)
[15:49:57] <wikibugs>	 (03PS1) 10Andrew Bogott: Wikitech:  include openstack::clientlib on silver [puppet] - 10https://gerrit.wikimedia.org/r/328530 (https://phabricator.wikimedia.org/T153807)
[15:50:09] <icinga-wm>	 PROBLEM - check_payments_wiki on payments2003 is CRITICAL: HTTP CRITICAL - Invalid HTTP response received from host on port 443: HTTP/1.1 301 Moved Permanently
[15:50:09] <icinga-wm>	 PROBLEM - check_payments_wiki on payments2002 is CRITICAL: HTTP CRITICAL - Invalid HTTP response received from host on port 443: HTTP/1.1 301 Moved Permanently
[15:52:03] <Jeff_Green>	 ^ fixing
[15:53:41] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 032] Wikitech:  include openstack::clientlib on silver [puppet] - 10https://gerrit.wikimedia.org/r/328530 (https://phabricator.wikimedia.org/T153807) (owner: 10Andrew Bogott)
[15:54:58] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 032] Update to 4.4.34 [debs/linux44] - 10https://gerrit.wikimedia.org/r/328529 (owner: 10Muehlenhoff)
[15:55:09] <icinga-wm>	 PROBLEM - check_payments_wiki on payments2002 is CRITICAL: HTTP CRITICAL - Invalid HTTP response received from host on port 443: HTTP/1.1 301 Moved Permanently
[15:55:09] <icinga-wm>	 PROBLEM - check_payments_wiki on payments2003 is CRITICAL: HTTP CRITICAL - Invalid HTTP response received from host on port 443: HTTP/1.1 301 Moved Permanently
[15:57:15] <wikibugs>	 (03PS1) 10Muehlenhoff: Update to 4.4.35 [debs/linux44] - 10https://gerrit.wikimedia.org/r/328531
[15:57:39] <icinga-wm>	 PROBLEM - puppet last run on maps1002 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[15:58:57] <wikibugs>	 (03PS2) 10Giuseppe Lavagetto: changeprop: use codfw's restbase in codfw as well [puppet] - 10https://gerrit.wikimedia.org/r/328521
[15:59:27] <wikibugs>	 (03PS2) 10Milimetric: Upgrade edit_history to mediawiki_history [puppet] - 10https://gerrit.wikimedia.org/r/325572
[15:59:57] <_joe_>	 mobrovac: merging ^^
[16:00:09] <icinga-wm>	 PROBLEM - check_payments_wiki on payments2002 is CRITICAL: HTTP CRITICAL - Invalid HTTP response received from host on port 443: HTTP/1.1 301 Moved Permanently
[16:00:09] <icinga-wm>	 PROBLEM - check_payments_wiki on payments2003 is CRITICAL: HTTP CRITICAL - Invalid HTTP response received from host on port 443: HTTP/1.1 301 Moved Permanently
[16:00:52] <mobrovac>	 kk _joe_, +1 from me
[16:01:03] <wikibugs>	 (03CR) 10Mobrovac: [C: 031] changeprop: use codfw's restbase in codfw as well [puppet] - 10https://gerrit.wikimedia.org/r/328521 (owner: 10Giuseppe Lavagetto)
[16:01:17] <wikibugs>	 (03PS3) 10Giuseppe Lavagetto: changeprop: use codfw's restbase in codfw as well [puppet] - 10https://gerrit.wikimedia.org/r/328521
[16:01:28] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [V: 032 C: 032] changeprop: use codfw's restbase in codfw as well [puppet] - 10https://gerrit.wikimedia.org/r/328521 (owner: 10Giuseppe Lavagetto)
[16:04:19] <icinga-wm>	 RECOVERY - check_payments_wiki on payments2002 is OK: HTTP OK: Status line output matched HTTP/1.1 503 - 214 bytes in 0.013 second response time
[16:04:29] <icinga-wm>	 RECOVERY - check_payments_wiki on payments2003 is OK: HTTP OK: Status line output matched HTTP/1.1 503 - 214 bytes in 0.013 second response time
[16:05:09] <icinga-wm>	 PROBLEM - check_listener_gc on saiph is CRITICAL: HTTP CRITICAL - Invalid HTTP response received from host on port 443: HTTP/1.1 404 Not Found
[16:05:09] <icinga-wm>	 PROBLEM - check_listener_ipn on saiph is CRITICAL: HTTP CRITICAL - Invalid HTTP response received from host on port 443: HTTP/1.1 404 Not Found
[16:05:59] <wikibugs>	 (03PS3) 10Gehel: New upstream version: 1.11.0 [debs/logstash-gelf] - 10https://gerrit.wikimedia.org/r/320992 (https://phabricator.wikimedia.org/T150408)
[16:06:19] <icinga-wm>	 RECOVERY - check_listener_gc on saiph is OK: HTTP OK: Status line output matched HTTP/1.1 503 - 214 bytes in 0.014 second response time
[16:06:19] <icinga-wm>	 RECOVERY - check_listener_ipn on saiph is OK: HTTP OK: Status line output matched HTTP/1.1 503 - 214 bytes in 0.012 second response time
[16:06:30] <wikibugs>	 06Operations, 06Labs, 13Patch-For-Review: openstackclient/keystoneclient on silver broken - https://phabricator.wikimedia.org/T153807#2893685 (10Andrew) 05Open>03Resolved a:03Andrew Attached patch plus a bit of fussing with packages (python-oslo-serialization vs python-oslo.serialization) and this is f...
[16:11:57] <wikibugs>	 (03PS3) 10Milimetric: Upgrade edit_history to mediawiki_history [puppet] - 10https://gerrit.wikimedia.org/r/325572
[16:12:29] <icinga-wm>	 PROBLEM - puppet last run on analytics1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[16:13:39] <icinga-wm>	 PROBLEM - puppet last run on labtestweb2001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[16:16:59] <wikibugs>	 (03CR) 10Joal: [C: 031] "LGTM !" [puppet] - 10https://gerrit.wikimedia.org/r/325572 (owner: 10Milimetric)
[16:17:58] <wikibugs>	 (03PS4) 10Elukey: Upgrade edit_history to mediawiki_history [puppet] - 10https://gerrit.wikimedia.org/r/325572 (owner: 10Milimetric)
[16:18:26] <wikibugs>	 (03PS1) 10Yuvipanda: labsdb: Add delete user functionality to maintain-dbusers [puppet] - 10https://gerrit.wikimedia.org/r/328533
[16:22:58] <wikibugs>	 (03PS2) 10Giuseppe Lavagetto: mediawiki::scaler: check orphaned HHVM threads [puppet] - 10https://gerrit.wikimedia.org/r/328194 (https://phabricator.wikimedia.org/T153488)
[16:24:04] <wikibugs>	 06Operations, 06Analytics-Kanban, 06Reading-Web-Backlog, 10Traffic: mobile-safari has very few internally-referred pageviews - https://phabricator.wikimedia.org/T148780#2893736 (10mforns)
[16:25:39] <icinga-wm>	 RECOVERY - puppet last run on maps1002 is OK: OK: Puppet is currently enabled, last run 50 seconds ago with 0 failures
[16:28:18] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C: 031] "Although those alarms are a symptom of a real problem, I agree." [puppet] - 10https://gerrit.wikimedia.org/r/328525 (owner: 10Elukey)
[16:29:04] <wikibugs>	 (03PS4) 10Elukey: Increase Redis Replica Sync retry interval to 2 minutes [puppet] - 10https://gerrit.wikimedia.org/r/328525
[16:31:15] <wikibugs>	 (03CR) 10Elukey: [C: 032] Increase Redis Replica Sync retry interval to 2 minutes [puppet] - 10https://gerrit.wikimedia.org/r/328525 (owner: 10Elukey)
[16:31:51] <wikibugs>	 (03PS5) 10Milimetric: Upgrade edit-history to mediawiki-history-beta [puppet] - 10https://gerrit.wikimedia.org/r/325572
[16:33:34] <wikibugs>	 (03PS6) 10Elukey: Upgrade edit-history to mediawiki-history-beta [puppet] - 10https://gerrit.wikimedia.org/r/325572 (owner: 10Milimetric)
[16:35:02] <wikibugs>	 (03CR) 10Elukey: [C: 032] Upgrade edit-history to mediawiki-history-beta [puppet] - 10https://gerrit.wikimedia.org/r/325572 (owner: 10Milimetric)
[16:38:13] <urandom>	 elukey, nuria: i'm all set on stat1002; thank you!
[16:38:35] <elukey>	 \o/
[16:39:19] <icinga-wm>	 PROBLEM - pivot on thorium is CRITICAL: connect to address 10.64.53.26 and port 9090: Connection refused
[16:40:04] <wikibugs>	 (03PS3) 10Giuseppe Lavagetto: mediawiki::scaler: check orphaned HHVM threads [puppet] - 10https://gerrit.wikimedia.org/r/328194 (https://phabricator.wikimedia.org/T153488)
[16:40:10] <elukey>	 pivot is me
[16:40:18] <_joe_>	 elukey: merging this ^^ FYI
[16:40:19] <icinga-wm>	 RECOVERY - pivot on thorium is OK: TCP OK - 0.000 second response time on 10.64.53.26 port 9090
[16:40:29] <elukey>	 _joe_ ack!
[16:40:47] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [V: 032 C: 032] mediawiki::scaler: check orphaned HHVM threads [puppet] - 10https://gerrit.wikimedia.org/r/328194 (https://phabricator.wikimedia.org/T153488) (owner: 10Giuseppe Lavagetto)
[16:41:42] <wikibugs>	 (03PS1) 10Milimetric: Fix typo in pivot config yaml [puppet] - 10https://gerrit.wikimedia.org/r/328537
[16:42:29] <icinga-wm>	 RECOVERY - puppet last run on analytics1001 is OK: OK: Puppet is currently enabled, last run 18 seconds ago with 0 failures
[16:42:49] <icinga-wm>	 PROBLEM - puppet last run on cp3035 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[16:50:42] <wikibugs>	 (03PS2) 10Milimetric: Fix typo in pivot config yaml [puppet] - 10https://gerrit.wikimedia.org/r/328537
[16:51:28] <wikibugs>	 06Operations, 10Ops-Access-Requests, 10Reading-Web-Trending-Service, 06Services (done), 15User-mobrovac: Allow @Jdlrobson and @bearND to deploy and manage the trending edits service - https://phabricator.wikimedia.org/T153458#2893783 (10bearND) Thank you. I'll try it out early next year when we can do de...
[16:54:19] <wikibugs>	 (03PS3) 10Elukey: Fix typo in pivot config yaml [puppet] - 10https://gerrit.wikimedia.org/r/328537 (owner: 10Milimetric)
[16:54:34] <wikibugs>	 06Operations, 06Performance-Team, 06Reading-Web-Backlog, 10Traffic, and 3 others: Performance review #2 of Hovercards (Popups extension) - https://phabricator.wikimedia.org/T70861#2893795 (10dr0ptp4kt)
[16:55:19] <wikibugs>	 (03CR) 10Elukey: [C: 032] Fix typo in pivot config yaml [puppet] - 10https://gerrit.wikimedia.org/r/328537 (owner: 10Milimetric)
[16:56:14] <mutante>	 biggest mediawikis we know of, by number of "good" pages:  1) wikidata   2) http://lietuvai.lt/wiki/  3) https://www.wikiocity.com   4) en.wp
[17:00:08] <wikibugs>	 06Operations, 06Performance-Team, 06Reading-Infrastructure-Team, 06Reading-Web-Backlog, and 4 others: Deploy TemplateStyles to WMF production - https://phabricator.wikimedia.org/T133410#2893805 (10dr0ptp4kt) @Tgr @Anomie: during our Q3 FY 2016-2017 interlock with Performance and Technical Operations @Gille...
[17:02:55] <wikibugs>	 06Operations, 06Performance-Team, 06Reading-Web-Backlog, 10Traffic, and 3 others: Performance review #2 of Hovercards (Popups extension) - https://phabricator.wikimedia.org/T70861#2893816 (10elukey) >>! In T70861#2893792, @dr0ptp4kt wrote: > You should also ensure @BBlack and @elukey from #traffic are in t...
[17:04:38] <wikibugs>	 06Operations, 06Performance-Team, 06Reading-Web-Backlog, 10Traffic, and 3 others: Performance review #2 of Hovercards (Popups extension) - https://phabricator.wikimedia.org/T70861#2893819 (10dr0ptp4kt) Sorry @elukey, hi @ema.
[17:05:35] <dr0ptp4kt>	 elukey: sorry about that! i fixed the other ticket to say ema as well :)
[17:07:17] <wikibugs>	 (03PS1) 10Joal: Update pivot conf for mediawiki history beta [puppet] - 10https://gerrit.wikimedia.org/r/328540
[17:07:23] <joal>	 elukey: --^ if you have time
[17:08:11] <elukey>	 dr0ptp4kt: no worries! I was chatting with Ema today about similar things happened recently, our usernames are close enough to be swappend sometimes :D
[17:09:47] <dr0ptp4kt>	 :)
[17:11:05] <icinga-wm>	 RECOVERY - puppet last run on cp3035 is OK: OK: Puppet is currently enabled, last run 25 seconds ago with 0 failures
[17:11:32] <wikibugs>	 06Operations, 10Continuous-Integration-Config, 10Continuous-Integration-Infrastructure, 07Zuul: Zuul has started failing on some repo's in gerrit.wikimedia.org - https://phabricator.wikimedia.org/T153877#2893837 (10hashar) @Andrew pbr 1.10.0 is broken it fails to recognize a version such as the Zuul one `2...
[17:13:42] <wikibugs>	 06Operations, 10Continuous-Integration-Config, 10Continuous-Integration-Infrastructure, 07Zuul: Zuul has started failing on some repo's in gerrit.wikimedia.org - https://phabricator.wikimedia.org/T153877#2893846 (10Paladox) @hashar but it works on another test instance for me.  gerrit-test.  root@gerrit-te...
[17:15:31] <elukey>	 joal: would you mind to also remove the old #description comment? I forgot to ask it in the previous code review..
[17:15:39] <joal>	 elukey: doing !
[17:16:07] <elukey>	 (that would be the super long comment)
[17:16:24] <elukey>	 joal: mmm not sure if we need it or not
[17:16:26] <elukey>	 maybe I am mistaken
[17:16:31] <elukey>	 I'll let you decide!
[17:16:51] <joal>	 elukey: I don't mind having it nor deleting it :)
[17:17:06] <elukey>	 all right let's leave it for now
[17:17:15] <joal>	 ok, anything else?
[17:17:17] <wikibugs>	 06Operations, 10DBA, 06Labs, 07Tracking: Database replication problems - production and labs (tracking) - https://phabricator.wikimedia.org/T50930#2893853 (10jcrespo)
[17:17:21] <wikibugs>	 (03CR) 10Elukey: [C: 032] Update pivot conf for mediawiki history beta [puppet] - 10https://gerrit.wikimedia.org/r/328540 (owner: 10Joal)
[17:17:26] <joal>	 Yay, thanks elukey :)
[17:30:59] <wikibugs>	 06Operations, 10DBA, 06Labs, 07Tracking: Database replication problems - production and labs (tracking) - https://phabricator.wikimedia.org/T50930#2893895 (10jcrespo)
[17:32:27] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 032] Update to 4.4.35 [debs/linux44] - 10https://gerrit.wikimedia.org/r/328531 (owner: 10Muehlenhoff)
[17:41:24] <wikibugs>	 (03PS1) 10Joal: Correct pivot conf for mediawiki history beta [puppet] - 10https://gerrit.wikimedia.org/r/328545
[17:41:31] <joal>	 elukey: as promised --^
[17:41:32] <joal>	 :)
[17:42:45] <wikibugs>	 (03PS1) 10Muehlenhoff: Update to 4.4.36 [debs/linux44] - 10https://gerrit.wikimedia.org/r/328546
[17:47:38] <wikibugs>	 (03PS1) 10Dzahn: add europium.eqiad.wmnet [dns] - 10https://gerrit.wikimedia.org/r/328547
[17:48:48] <wikibugs>	 (03PS2) 10Dzahn: add europium.eqiad.wmnet [dns] - 10https://gerrit.wikimedia.org/r/328547
[17:49:20] <wikibugs>	 (03CR) 10Elukey: [C: 032] Correct pivot conf for mediawiki history beta [puppet] - 10https://gerrit.wikimedia.org/r/328545 (owner: 10Joal)
[17:49:35] <wikibugs>	 (03CR) 10Dzahn: [C: 032] add europium.eqiad.wmnet [dns] - 10https://gerrit.wikimedia.org/r/328547 (owner: 10Dzahn)
[17:52:01] <elukey>	 joal: done!
[17:52:05] <wikibugs>	 (03PS1) 10Dzahn: Revert "remove dhcp entry for old hostname europium" [puppet] - 10https://gerrit.wikimedia.org/r/328548
[17:52:28] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Revert "remove dhcp entry for old hostname europium" [puppet] - 10https://gerrit.wikimedia.org/r/328548 (owner: 10Dzahn)
[17:53:07] <wikibugs>	 (03CR) 10Dzahn: "i found this orphaned host and it still had the old OS on it. i reopened T82239 and just re-adding this because i want to test installserv" [puppet] - 10https://gerrit.wikimedia.org/r/328548 (owner: 10Dzahn)
[17:53:43] <icinga-wm>	 PROBLEM - puppet last run on sca1003 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[17:54:15] <mutante>	 well, turns out maybe hitting revert on a change from 2013 doesnt always work that well
[17:58:30] <_joe_>	 lol
[18:02:57] <wikibugs>	 (03PS1) 10Dzahn: dhcp: re-add europium but in eqiad.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/328550
[18:05:13] <icinga-wm>	 PROBLEM - check_payments_wiki on payments2001 is CRITICAL: HTTP CRITICAL - Invalid HTTP response received from host on port 443: HTTP/1.1 301 Moved Permanently
[18:07:37] <Jeff_Green>	 grrr. payments1001 ^^ I see it
[18:08:35] <wikibugs>	 (03PS2) 10Dzahn: dhcp: re-add europium but in eqiad.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/328550
[18:09:31] <wikibugs>	 (03CR) 10Dzahn: [C: 032] dhcp: re-add europium but in eqiad.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/328550 (owner: 10Dzahn)
[18:10:10] <wikibugs>	 (03Abandoned) 10Dzahn: Revert "remove dhcp entry for old hostname europium" [puppet] - 10https://gerrit.wikimedia.org/r/328548 (owner: 10Dzahn)
[18:13:08] <mutante>	 !log carbon - temp stopping dhcp server 
[18:13:10] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:14:12] <wikibugs>	 (03PS4) 10Alexandros Kosiaris: kubernetes::master: Introduce the kubernetes profile [puppet] - 10https://gerrit.wikimedia.org/r/328174
[18:14:14] <wikibugs>	 (03PS4) 10Alexandros Kosiaris: Create and assign the kubernetes::master role [puppet] - 10https://gerrit.wikimedia.org/r/328175
[18:14:16] <wikibugs>	 (03PS17) 10Alexandros Kosiaris: Add profile::kubernetes::node profile class [puppet] - 10https://gerrit.wikimedia.org/r/324212
[18:14:18] <wikibugs>	 (03PS16) 10Alexandros Kosiaris: Include ::profile::kubernetes::node in role::kubernetes::worker [puppet] - 10https://gerrit.wikimedia.org/r/324213
[18:14:20] <wikibugs>	 (03PS1) 10Alexandros Kosiaris: k8s::apiserver: Allow specifying the SSL file paths [puppet] - 10https://gerrit.wikimedia.org/r/328553
[18:16:33] <icinga-wm>	 PROBLEM - puppet last run on elastic1018 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[18:22:43] <icinga-wm>	 RECOVERY - puppet last run on sca1003 is OK: OK: Puppet is currently enabled, last run 21 seconds ago with 0 failures
[18:27:01] <wikibugs>	 (03PS4) 10Marostegui: [WIP] Reporting tests with the private data script [puppet] - 10https://gerrit.wikimedia.org/r/328352 (https://phabricator.wikimedia.org/T153680)
[18:29:37] <wikibugs>	 (03CR) 10Marostegui: [WIP] Reporting tests with the private data script (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/328352 (https://phabricator.wikimedia.org/T153680) (owner: 10Marostegui)
[18:31:37] <wikibugs>	 06Operations, 10Continuous-Integration-Config, 10Continuous-Integration-Infrastructure, 07Zuul: Zuul has started failing on some repo's in gerrit.wikimedia.org - https://phabricator.wikimedia.org/T153877#2894082 (10hashar) We had version 0.8.2  The CI instance have an unattended-upgrade for repositories *-...
[18:33:23] <icinga-wm>	 RECOVERY - check_payments_wiki on payments2001 is OK: HTTP OK: HTTP/1.1 200 OK - 258 bytes in 0.013 second response time
[18:34:24] <wikibugs>	 (03PS1) 10Andrew Bogott: Try to pin python-pbr [puppet] - 10https://gerrit.wikimedia.org/r/328555
[18:44:33] <icinga-wm>	 RECOVERY - puppet last run on elastic1018 is OK: OK: Puppet is currently enabled, last run 12 seconds ago with 0 failures
[18:46:45] <wikibugs>	 (03PS3) 10Dzahn: install/dhcp: switch all "next-server" from carbon to install1001 [puppet] - 10https://gerrit.wikimedia.org/r/328439 (https://phabricator.wikimedia.org/T132757)
[18:47:11] <wikibugs>	 (03PS1) 10Filippo Giunchedi: prometheus: fix syntax for mysql recording rules [puppet] - 10https://gerrit.wikimedia.org/r/328557
[18:47:20] <wikibugs>	 (03CR) 10Dzahn: "the only thing i could imagine is a problem here is that labs subnets are somehow special / need other config" [puppet] - 10https://gerrit.wikimedia.org/r/328439 (https://phabricator.wikimedia.org/T132757) (owner: 10Dzahn)
[18:50:37] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C: 031] install/dhcp: switch all "next-server" from carbon to install1001 [puppet] - 10https://gerrit.wikimedia.org/r/328439 (https://phabricator.wikimedia.org/T132757) (owner: 10Dzahn)
[18:59:26] <wikibugs>	 06Operations, 10Annual-Report: add subdomain for annual report 2016 - https://phabricator.wikimedia.org/T151798#2894164 (10ZMcCune) @Heather: Any reason not to use https://annual.wikimedia.org/2016/?   @Dzahn: Thank you! We will let you know. Hope to have the static pages ready in early January.
[19:03:55] <wikibugs>	 (03PS2) 10Andrew Bogott: Pin python-pbr to an old version for Zuul [puppet] - 10https://gerrit.wikimedia.org/r/328555 (https://phabricator.wikimedia.org/T153877)
[19:04:38] <AndyRussG>	 Hi operations! Quick question: what kind of bad things could happen to a network that we could measure? Besides slowness? Maybe dropped connections?
[19:04:52] <AndyRussG>	 Anyway to measure the rate of those by geographic region?
[19:05:13] <icinga-wm>	 PROBLEM - Redis status tcp_6479 on rdb2006 is CRITICAL: CRITICAL ERROR - Redis Library - can not ping 10.192.48.44 on port 6479
[19:05:59] <wikibugs>	 (03PS3) 10Andrew Bogott: Pin python-pbr to an old version for Zuul [puppet] - 10https://gerrit.wikimedia.org/r/328555 (https://phabricator.wikimedia.org/T153877)
[19:06:08] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C: 031] Pin python-pbr to an old version for Zuul [puppet] - 10https://gerrit.wikimedia.org/r/328555 (https://phabricator.wikimedia.org/T153877) (owner: 10Andrew Bogott)
[19:06:13] <icinga-wm>	 RECOVERY - Redis status tcp_6479 on rdb2006 is OK: OK: REDIS 2.8.17 on 10.192.48.44:6479 has 1 databases (db0) with 3979921 keys, up 51 days 10 hours - replication_delay is 43
[19:06:59] <AndyRussG>	 bblack: ema: ^ ? (apologies for bugging u so often 8p)
[19:07:46] <wikibugs>	 (03PS5) 10Alexandros Kosiaris: kubernetes::master: Introduce the kubernetes profile [puppet] - 10https://gerrit.wikimedia.org/r/328174
[19:07:48] <wikibugs>	 (03PS5) 10Alexandros Kosiaris: Create and assign the kubernetes::master role [puppet] - 10https://gerrit.wikimedia.org/r/328175
[19:07:50] <wikibugs>	 (03PS18) 10Alexandros Kosiaris: Add profile::kubernetes::node profile class [puppet] - 10https://gerrit.wikimedia.org/r/324212
[19:07:52] <wikibugs>	 (03PS17) 10Alexandros Kosiaris: Include ::profile::kubernetes::node in role::kubernetes::worker [puppet] - 10https://gerrit.wikimedia.org/r/324213
[19:08:01] <mutante>	 AndyRussG: fwiw, i know we have probes that are part of https://atlas.ripe.net/
[19:08:30] <mutante>	 so that kind of exists already
[19:09:06] <wikibugs>	 06Operations, 10ops-codfw, 06DC-Ops, 06Discovery, and 2 others: elastic2020 is powered off and does not want to restart - https://phabricator.wikimedia.org/T149006#2739503 (10Deskana) @gehel Has everything gone as planned? I assume silence on this ticket is good news. :-)
[19:11:22] <wikibugs>	 06Operations, 10ops-codfw, 06DC-Ops, 06Discovery, and 2 others: elastic2020 is powered off and does not want to restart - https://phabricator.wikimedia.org/T149006#2894219 (10Gehel) Silence is a good thing! But traffic has left codfw again, and not long after the firmware upgrade by @Papaul.  So it works,...
[19:12:20] <wikibugs>	 (03CR) 10Dzahn: [C: 032] install/dhcp: switch all "next-server" from carbon to install1001 [puppet] - 10https://gerrit.wikimedia.org/r/328439 (https://phabricator.wikimedia.org/T132757) (owner: 10Dzahn)
[19:12:26] <AndyRussG>	 mutante: cool, thanks!
[19:12:26] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 032] Pin python-pbr to an old version for Zuul [puppet] - 10https://gerrit.wikimedia.org/r/328555 (https://phabricator.wikimedia.org/T153877) (owner: 10Andrew Bogott)
[19:12:35] <wikibugs>	 (03PS4) 10Andrew Bogott: Pin python-pbr to an old version for Zuul [puppet] - 10https://gerrit.wikimedia.org/r/328555 (https://phabricator.wikimedia.org/T153877)
[19:13:53] <mutante>	 !log carbon - re-enabled puppet and DHCP
[19:13:55] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:15:22] <mutante>	 !log public1-b-eqiad and public1-c-eqiad are configured to use install1001 as DHCP, all others still use carbon as DHCP | all subnets now use install1001 as TFTP
[19:15:24] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:16:08] <wikibugs>	 06Operations, 10Continuous-Integration-Config, 10Continuous-Integration-Infrastructure, 13Patch-For-Review, 07Zuul: Zuul has started failing on some repo's in gerrit.wikimedia.org - https://phabricator.wikimedia.org/T153877#2894226 (10hashar) Cherry picked https://gerrit.wikimedia.org/r/328555 on the CI...
[19:16:16] <wikibugs>	 06Operations, 10ops-codfw, 06DC-Ops, 06Discovery, and 2 others: elastic2020 is powered off and does not want to restart - https://phabricator.wikimedia.org/T149006#2894227 (10Deskana) >>! In T149006#2894219, @Gehel wrote: > Silence is a good thing! But traffic has left codfw again, and not long after the f...
[19:16:36] <wikibugs>	 06Operations, 10ops-codfw, 06DC-Ops, 06Discovery, and 2 others: elastic2020 is powered off and does not want to restart - https://phabricator.wikimedia.org/T149006#2739503 (10Deskana) 05Open>03Resolved
[19:17:04] <wikibugs>	 06Operations, 10Annual-Report: add subdomain for annual report 2016 - https://phabricator.wikimedia.org/T151798#2894232 (10Heather) That looks good. Thanks, everyone!
[19:17:54] <mutante>	 is relieved
[19:18:01] <mutante>	 that we are ok with annual.wm/2016 
[19:25:36] <wikibugs>	 06Operations, 10Continuous-Integration-Config, 10Continuous-Integration-Infrastructure, 13Patch-For-Review, 07Zuul: Zuul has started failing on some repo's in gerrit.wikimedia.org - https://phabricator.wikimedia.org/T153877#2894252 (10hashar) 05Open>03Resolved a:03Andrew Ran puppet on contint1001 /...
[19:27:01] <wikibugs>	 (03PS2) 10Filippo Giunchedi: prometheus: fix syntax for mysql recording rules [puppet] - 10https://gerrit.wikimedia.org/r/328557
[19:28:32] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 032] prometheus: fix syntax for mysql recording rules [puppet] - 10https://gerrit.wikimedia.org/r/328557 (owner: 10Filippo Giunchedi)
[19:34:43] <icinga-wm>	 PROBLEM - puppet last run on sca1004 is CRITICAL: CRITICAL: Puppet has 27 failures. Last run 2 minutes ago with 27 failures. Failed resources (up to 3 shown): Exec[eth0_v6_token],Package[wipe],Package[zotero/translators],Package[zotero/translation-server]
[19:39:53] <wikibugs>	 (03CR) 10Filippo Giunchedi: "> I may need help to apply this to mysql-aggregated." [puppet] - 10https://gerrit.wikimedia.org/r/328425 (owner: 10Filippo Giunchedi)
[19:52:05] <wikibugs>	 (03PS2) 10Yuvipanda: labsdb: Add delete user functionality to maintain-dbusers [puppet] - 10https://gerrit.wikimedia.org/r/328533
[19:55:21] <wikibugs>	 (03PS3) 10Yuvipanda: labsdb: Add delete user functionality to maintain-dbusers [puppet] - 10https://gerrit.wikimedia.org/r/328533
[19:55:28] <wikibugs>	 (03CR) 10Yuvipanda: [V: 032 C: 032] labsdb: Add delete user functionality to maintain-dbusers [puppet] - 10https://gerrit.wikimedia.org/r/328533 (owner: 10Yuvipanda)
[20:01:43] <icinga-wm>	 PROBLEM - puppet last run on mw1282 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[20:02:43] <wikibugs>	 (03PS1) 10Yuvipanda: labsdb: Run maintain-dbusers only on active NFS host [puppet] - 10https://gerrit.wikimedia.org/r/328564
[20:02:43] <icinga-wm>	 RECOVERY - puppet last run on sca1004 is OK: OK: Puppet is currently enabled, last run 48 seconds ago with 0 failures
[20:04:09] <wikibugs>	 (03Abandoned) 10Yuvipanda: base: Use the standard location for puppet ca [puppet] - 10https://gerrit.wikimedia.org/r/257275 (owner: 10Yuvipanda)
[20:04:13] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] labsdb: Run maintain-dbusers only on active NFS host [puppet] - 10https://gerrit.wikimedia.org/r/328564 (owner: 10Yuvipanda)
[20:04:26] <wikibugs>	 (03Abandoned) 10Yuvipanda: labs: Limit who can login via the ssh key lookup tool too [puppet] - 10https://gerrit.wikimedia.org/r/259455 (owner: 10Yuvipanda)
[20:04:37] <wikibugs>	 (03Abandoned) 10Yuvipanda: base: Do not do add host nagios monitoring in labs [puppet] - 10https://gerrit.wikimedia.org/r/262359 (https://phabricator.wikimedia.org/T122757) (owner: 10Yuvipanda)
[20:04:43] <wikibugs>	 (03Abandoned) 10Yuvipanda: toollabs: Point shadow to correct master host [puppet] - 10https://gerrit.wikimedia.org/r/265199 (owner: 10Yuvipanda)
[20:05:26] <wikibugs>	 (03Abandoned) 10Yuvipanda: cache: Add labtestspice.wikimedia.org behind misc varnish [puppet] - 10https://gerrit.wikimedia.org/r/301178 (owner: 10Yuvipanda)
[20:05:31] <wikibugs>	 (03Abandoned) 10Yuvipanda: tools: Do not have static class inherit from toollabs [puppet] - 10https://gerrit.wikimedia.org/r/278431 (https://phabricator.wikimedia.org/T128411) (owner: 10Yuvipanda)
[20:06:49] <wikibugs>	 (03Abandoned) 10Yuvipanda: ssh: Disable 2fa for labs [puppet] - 10https://gerrit.wikimedia.org/r/318981 (https://phabricator.wikimedia.org/T147998) (owner: 10Yuvipanda)
[20:07:28] <wikibugs>	 (03PS5) 10Yuvipanda: labs: Clean out projects that don't exist anymore from mounts [puppet] - 10https://gerrit.wikimedia.org/r/327522
[20:08:32] <wikibugs>	 (03CR) 10Yuvipanda: [V: 032 C: 032] labs: Clean out projects that don't exist anymore from mounts [puppet] - 10https://gerrit.wikimedia.org/r/327522 (owner: 10Yuvipanda)
[20:10:05] <wikibugs>	 (03Abandoned) 10Yuvipanda: [WIP] ldap: Cleanup module [puppet] - 10https://gerrit.wikimedia.org/r/287663 (owner: 10Yuvipanda)
[20:10:45] <wikibugs>	 (03Abandoned) 10Yuvipanda: tools: Use phabricator as source for kubernetes building [puppet] - 10https://gerrit.wikimedia.org/r/303727 (https://phabricator.wikimedia.org/T142448) (owner: 10Yuvipanda)
[20:11:07] <wikibugs>	 (03PS3) 10Yuvipanda: puppetmaster: Cleanup unused vars / crons in labs puppetmaster [puppet] - 10https://gerrit.wikimedia.org/r/312317
[20:11:37] <wikibugs>	 (03Abandoned) 10Yuvipanda: puppet: Disable enc by default on trusty for now [puppet] - 10https://gerrit.wikimedia.org/r/311761 (owner: 10Yuvipanda)
[20:18:03] <icinga-wm>	 PROBLEM - Unmerged changes on repository puppet on puppetmaster1001 is CRITICAL: There is one unmerged change in puppet (dir /var/lib/git/operations/puppet, ref HEAD..origin/production).
[20:20:27] <wikibugs>	 (03CR) 10Yuvipanda: [C: 032] Tools: Quote arguments in clush [puppet] - 10https://gerrit.wikimedia.org/r/326380 (owner: 10Tim Landscheidt)
[20:21:09] <yuvipanda>	 ^ was me
[20:21:10] <yuvipanda>	 sorry
[20:22:03] <icinga-wm>	 RECOVERY - Unmerged changes on repository puppet on puppetmaster1001 is OK: No changes to merge.
[20:25:30] <wikibugs>	 (03PS15) 10Paladox: Gerrit: Enable logstash in gerrit [puppet] - 10https://gerrit.wikimedia.org/r/326177 (https://phabricator.wikimedia.org/T141324)
[20:29:43] <icinga-wm>	 RECOVERY - puppet last run on mw1282 is OK: OK: Puppet is currently enabled, last run 8 seconds ago with 0 failures
[20:30:29] <wikibugs>	 (03CR) 10Yuvipanda: "I personally like to stick to bash, but I don't particularly care either way..." [puppet] - 10https://gerrit.wikimedia.org/r/326379 (owner: 10Tim Landscheidt)
[20:30:35] <wikibugs>	 (03PS2) 10Yuvipanda: Tools: Remove bashisms from clush [puppet] - 10https://gerrit.wikimedia.org/r/326379 (owner: 10Tim Landscheidt)
[20:30:41] <wikibugs>	 (03CR) 10Yuvipanda: [V: 032 C: 032] Tools: Remove bashisms from clush [puppet] - 10https://gerrit.wikimedia.org/r/326379 (owner: 10Tim Landscheidt)
[20:30:51] <wikibugs>	 (03PS2) 10Yuvipanda: Tools: Quote arguments in clush [puppet] - 10https://gerrit.wikimedia.org/r/326380 (owner: 10Tim Landscheidt)
[20:30:55] <wikibugs>	 (03CR) 10Yuvipanda: [V: 032 C: 032] Tools: Quote arguments in clush [puppet] - 10https://gerrit.wikimedia.org/r/326380 (owner: 10Tim Landscheidt)
[20:41:03] <icinga-wm>	 PROBLEM - Unmerged changes on repository puppet on puppetmaster1001 is CRITICAL: There are 2 unmerged changes in puppet (dir /var/lib/git/operations/puppet, ref HEAD..origin/production).
[20:41:53] <icinga-wm>	 PROBLEM - puppet last run on ms-be1016 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[20:43:03] <icinga-wm>	 RECOVERY - Unmerged changes on repository puppet on puppetmaster1001 is OK: No changes to merge.
[20:45:26] <wikibugs>	 06Operations, 13Patch-For-Review: Split carbon's install/mirror roles, provision install1001 - https://phabricator.wikimedia.org/T132757#2894515 (10Dzahn) - added test host europium (https://gerrit.wikimedia.org/r/328547, https://gerrit.wikimedia.org/r/328550, https://gerrit.wikimedia.org/r/328548) - @akosiari...
[20:47:45] <wikibugs>	 (03PS2) 10Yuvipanda: labsdb: Run maintain-dbusers only on active NFS host [puppet] - 10https://gerrit.wikimedia.org/r/328564
[20:48:06] <wikibugs>	 (03PS3) 10Yuvipanda: labsdb: Run maintain-dbusers only on active NFS host [puppet] - 10https://gerrit.wikimedia.org/r/328564
[20:48:48] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] labsdb: Run maintain-dbusers only on active NFS host [puppet] - 10https://gerrit.wikimedia.org/r/328564 (owner: 10Yuvipanda)
[20:48:59] <wikibugs>	 06Operations, 13Patch-For-Review: Split carbon's install/mirror roles, provision install1001 - https://phabricator.wikimedia.org/T132757#2894537 (10Dzahn) other things we need:  - move ganglia aggregator - change ACLs / ferm rules for webproxy -- (it's still  webproxy  1H  IN CNAME    carbon.wikimedia.org. in...
[20:52:15] <wikibugs>	 (03PS4) 10Yuvipanda: labsdb: Run maintain-dbusers only on active NFS host [puppet] - 10https://gerrit.wikimedia.org/r/328564
[20:53:23] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] labsdb: Run maintain-dbusers only on active NFS host [puppet] - 10https://gerrit.wikimedia.org/r/328564 (owner: 10Yuvipanda)
[20:54:12] <wikibugs>	 (03PS5) 10Yuvipanda: labsdb: Run maintain-dbusers only on active NFS host [puppet] - 10https://gerrit.wikimedia.org/r/328564
[20:57:40] <wikibugs>	 (03PS1) 10Urbanecm: Set sortPrepend for gdwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/328570 (https://phabricator.wikimedia.org/T153900)
[20:58:15] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Set sortPrepend for gdwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/328570 (https://phabricator.wikimedia.org/T153900) (owner: 10Urbanecm)
[20:58:24] <wikibugs>	 (03PS6) 10Yuvipanda: labsdb: Run maintain-dbusers only on active NFS host [puppet] - 10https://gerrit.wikimedia.org/r/328564
[20:58:41] <wikibugs>	 (03CR) 10Yuvipanda: [V: 032 C: 032] labsdb: Run maintain-dbusers only on active NFS host [puppet] - 10https://gerrit.wikimedia.org/r/328564 (owner: 10Yuvipanda)
[20:59:23] <wikibugs>	 (03PS2) 10Urbanecm: Set sortPrepend for gdwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/328570 (https://phabricator.wikimedia.org/T153900)
[21:01:30] <wikibugs>	 (03Draft1) 10Paladox: Gerrit: Convert from utf8 to utf8mb4 [puppet] - 10https://gerrit.wikimedia.org/r/328571 (https://phabricator.wikimedia.org/T153899)
[21:03:18] <wikibugs>	 (03PS2) 10Paladox: Gerrit: Convert from utf8 to utf8mb4 [puppet] - 10https://gerrit.wikimedia.org/r/328571 (https://phabricator.wikimedia.org/T153899)
[21:05:47] <wikibugs>	 (03PS1) 10Yuvipanda: labsdb: Followup to I3156a406f37dd5344273faf5c770c32eddee0e25 [puppet] - 10https://gerrit.wikimedia.org/r/328572
[21:06:20] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] labsdb: Followup to I3156a406f37dd5344273faf5c770c32eddee0e25 [puppet] - 10https://gerrit.wikimedia.org/r/328572 (owner: 10Yuvipanda)
[21:07:13] <icinga-wm>	 PROBLEM - puppet last run on labstore1004 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[21:08:03] <wikibugs>	 (03PS2) 10Yuvipanda: labsdb: Followup to I3156a406f37dd5344273faf5c770c32eddee0e25 [puppet] - 10https://gerrit.wikimedia.org/r/328572
[21:08:35] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] labsdb: Followup to I3156a406f37dd5344273faf5c770c32eddee0e25 [puppet] - 10https://gerrit.wikimedia.org/r/328572 (owner: 10Yuvipanda)
[21:09:00] <wikibugs>	 (03PS3) 10Yuvipanda: labsdb: Followup to I3156a406f37dd5344273faf5c770c32eddee0e25 [puppet] - 10https://gerrit.wikimedia.org/r/328572
[21:10:03] <icinga-wm>	 RECOVERY - puppet last run on ms-be1016 is OK: OK: Puppet is currently enabled, last run 39 seconds ago with 0 failures
[21:11:34] <wikibugs>	 06Operations, 10Wikimedia-Site-requests, 07Chinese-Sites, 07Community-consensus-needed: Enable "upload by url" feature at zhwiki - https://phabricator.wikimedia.org/T142991#2894683 (10Dzahn)
[21:12:07] <wikibugs>	 (03CR) 10Yuvipanda: [C: 032] labsdb: Followup to I3156a406f37dd5344273faf5c770c32eddee0e25 [puppet] - 10https://gerrit.wikimedia.org/r/328572 (owner: 10Yuvipanda)
[21:13:21] <wikibugs>	 06Operations, 10Wikimedia-Site-requests, 07Chinese-Sites, 07Community-consensus-needed: Enable "upload by url" feature at zhwiki - https://phabricator.wikimedia.org/T142991#2553271 (10Dzahn) re-adding Operations because of "There is still the question to get ops green light for upload.wikimedia.org copy by...
[21:13:59] <wikibugs>	 06Operations, 10Wikimedia-Site-requests, 07Chinese-Sites, 07Community-consensus-needed: Enable "upload by url" feature at zhwiki - https://phabricator.wikimedia.org/T142991#2894692 (10Dzahn) p:05Triage>03Normal
[21:14:06] <wikibugs>	 06Operations, 10Wikimedia-Site-requests, 07Chinese-Sites, 07Community-consensus-needed: Enable "upload by url" feature at zhwiki - https://phabricator.wikimedia.org/T142991#2553271 (10Dzahn) 05Open>03stalled
[21:14:13] <icinga-wm>	 RECOVERY - puppet last run on labstore1004 is OK: OK: Puppet is currently enabled, last run 30 seconds ago with 0 failures
[21:24:55] <wikibugs>	 (03CR) 10Alex Monk: "It may be difficult to find someone who has the necessary privileges, knowledge and interest to do this instead" [puppet] - 10https://gerrit.wikimedia.org/r/287663 (owner: 10Yuvipanda)
[21:25:34] <wikibugs>	 (03CR) 10Yuvipanda: "Yup :(" [puppet] - 10https://gerrit.wikimedia.org/r/287663 (owner: 10Yuvipanda)
[21:39:23] <wikibugs>	 (03CR) 10Andrew Bogott: "I'm sure that these can go, but best to merge this after I'm back from traveling." [puppet] - 10https://gerrit.wikimedia.org/r/318451 (owner: 10Dzahn)
[22:16:44] <wikibugs>	 (03PS1) 10Smalyshev: Add configuration for query endpoint URL [puppet] - 10https://gerrit.wikimedia.org/r/328582 (https://phabricator.wikimedia.org/T153897)
[22:17:28] <wikibugs>	 06Operations, 05Prometheus-metrics-monitoring: Improvements to Ganglia-equivalent Prometheus dashboards - https://phabricator.wikimedia.org/T152791#2894804 (10fgiunchedi) @ArielGlenn indeed the stacked graphs are meant for cluster-wide overviews, would the breakdown per-host be enough in this case for what you...
[22:18:42] <wikibugs>	 (03CR) 10Dzahn: [C: 031] Introduce dbmonitor1001, dbmonitor2001 [puppet] - 10https://gerrit.wikimedia.org/r/328509 (https://phabricator.wikimedia.org/T149557) (owner: 10Alexandros Kosiaris)
[22:39:03] <wikibugs>	 06Operations, 05Prometheus-metrics-monitoring: Improvements to Ganglia-equivalent Prometheus dashboards - https://phabricator.wikimedia.org/T152791#2894847 (10ArielGlenn) >>! In T152791#2894804, @fgiunchedi wrote: > @ArielGlenn indeed the stacked graphs are meant for cluster-wide overviews, would the breakdown...
[22:43:43] <icinga-wm>	 PROBLEM - puppet last run on labvirt1011 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[23:07:53] <icinga-wm>	 PROBLEM - puppet last run on labvirt1006 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[23:11:43] <icinga-wm>	 RECOVERY - puppet last run on labvirt1011 is OK: OK: Puppet is currently enabled, last run 3 seconds ago with 0 failures
[23:19:41] <wikibugs>	 (03PS1) 10Matanya: remove absented file long gone [puppet] - 10https://gerrit.wikimedia.org/r/328596
[23:31:10] <wikibugs>	 06Operations: Setup europium as locke replacement - https://phabricator.wikimedia.org/T82239#2894958 (10Dzahn)
[23:31:17] <mutante>	 !log europium - re-installing with jessie (T82239)
[23:31:23] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:31:23] <stashbot>	 T82239: Setup europium as locke replacement - https://phabricator.wikimedia.org/T82239
[23:34:43] <wikibugs>	 06Operations: reclaim europium - https://phabricator.wikimedia.org/T153918#2894991 (10Dzahn)
[23:35:53] <icinga-wm>	 RECOVERY - puppet last run on labvirt1006 is OK: OK: Puppet is currently enabled, last run 15 seconds ago with 0 failures
[23:36:03] <wikibugs>	 06Operations: reclaim europium - https://phabricator.wikimedia.org/T153918#2894991 (10Dzahn) https://gerrit.wikimedia.org/r/#/c/328547/ https://gerrit.wikimedia.org/r/#/c/328550/ https://gerrit.wikimedia.org/r/#/c/328548/  ----  15:32 < mutante> !log europium - re-installing with jessie (T82239)  ----  https://r...
[23:36:31] <wikibugs>	 06Operations: Setup europium as locke replacement - https://phabricator.wikimedia.org/T82239#898279 (10Dzahn) 05Open>03Resolved created subtask for reclaim
[23:37:03] <wikibugs>	 06Operations, 10ops-eqiad: reclaim europium - https://phabricator.wikimedia.org/T153918#2895016 (10Dzahn)
[23:37:54] <wikibugs>	 06Operations, 10ops-eqiad: reclaim europium - https://phabricator.wikimedia.org/T153918#2895020 (10Dzahn) europium.eqiad.wmnet  - eqiad row C - C7  @ 38
[23:39:13] <icinga-wm>	 PROBLEM - puppet last run on sca2004 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[23:45:43] <matanya>	 Reedy: https://phabricator.wikimedia.org/T153920 can you please triage ? 
[23:45:56] <Reedy>	 matanya: It's a dupe
[23:46:04] <matanya>	 oh, sorry
[23:46:21] <Reedy>	 :)
[23:48:04] <mutante>	 !log europium - jessie reinstall done - powered down until until reclaim (T153918)
[23:48:07] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:48:07] <stashbot>	 T153918: reclaim europium - https://phabricator.wikimedia.org/T153918
[23:49:18] <wikibugs>	 06Operations, 10ops-eqiad: reclaim europium - https://phabricator.wikimedia.org/T153918#2895108 (10Dzahn) a:05RobH>03None
[23:50:43] <icinga-wm>	 PROBLEM - puppet last run on db1024 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[23:51:44] <wikibugs>	 (03PS8) 10Andrew Bogott: Keystone:  Move api service to uwsgi/nginx [puppet] - 10https://gerrit.wikimedia.org/r/328400 (https://phabricator.wikimedia.org/T150774)
[23:55:23] <wikibugs>	 (03PS1) 10Dzahn: openstack: switch tftp server from carbon to install1001 [puppet] - 10https://gerrit.wikimedia.org/r/328597 (https://phabricator.wikimedia.org/T123733)