[00:42:37] Hey... [00:42:58] Anyone around that has DB access? [00:44:00] https://quarry.wmflabs.org/query/14916 <- the 35 entries that have ‘transcode_time_addjob’ dates from 2013 are ‘unfixable. [00:45:06] PROBLEM - puppet last run on mw1266 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [00:45:14] They are from videos that were ‘renamed’ while transcodes were on the queue, and appear to be the result of a long-fixed bug. [00:46:40] transcode ids, in that report, of less than 127000… they can usefully simply ‘go away’ [00:47:27] It can be verified, they are entries for files that no longer exist under hat filename. [00:47:32] *that [01:05:06] PROBLEM - puppet last run on sca1004 is CRITICAL: CRITICAL: Puppet has 27 failures. Last run 2 minutes ago with 27 failures. Failed resources (up to 3 shown): Exec[eth0_v6_token],Package[zotero/translators],Package[zotero/translation-server],Exec[chown /srv/deployment/zotero for deploy-service] [01:13:06] RECOVERY - puppet last run on mw1266 is OK: OK: Puppet is currently enabled, last run 48 seconds ago with 0 failures [01:31:56] RECOVERY - puppet last run on sca1004 is OK: OK: Puppet is currently enabled, last run 14 seconds ago with 0 failures [01:51:26] PROBLEM - puppet last run on cp3013 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [02:20:26] RECOVERY - puppet last run on cp3013 is OK: OK: Puppet is currently enabled, last run 35 seconds ago with 0 failures [02:21:05] !log l10nupdate@tin scap sync-l10n completed (1.29.0-wmf.7) (duration: 07m 46s) [02:21:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:25:27] !log l10nupdate@tin ResourceLoader cache refresh completed at Sun Jan 15 02:25:27 UTC 2017 (duration 4m 23s) [02:25:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:40:04] 06Operations, 06Commons, 10TimedMediaHandler-Transcode, 10Wikimedia-Video, and 3 others: Commons video transcoders have over 6500 tasks in the backlog. - https://phabricator.wikimedia.org/T153488#2940696 (10SamanthaNguyen) [02:44:03] (03CR) 10Tim Landscheidt: "With T92813 resolved, is this patch still needed?" [puppet] - 10https://gerrit.wikimedia.org/r/257861 (https://phabricator.wikimedia.org/T92813) (owner: 10Filippo Giunchedi) [02:44:44] (03CR) 10Tim Landscheidt: "Same here: With T92813 resolved, is this patch still needed?" [puppet] - 10https://gerrit.wikimedia.org/r/257860 (https://phabricator.wikimedia.org/T92813) (owner: 10Filippo Giunchedi) [03:23:06] PROBLEM - MariaDB Slave Lag: s1 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 789.03 seconds [03:27:06] RECOVERY - MariaDB Slave Lag: s1 on dbstore1002 is OK: OK slave_sql_lag Replication lag: 213.03 seconds [03:33:06] PROBLEM - puppet last run on mw1252 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/usr/share/GeoIP/GeoIPCity.dat.gz] [03:55:50] (03PS1) 10Juniorsys: authdns: Add trailing comma [puppet] - 10https://gerrit.wikimedia.org/r/332093 (https://phabricator.wikimedia.org/T93645) [03:58:34] (03PS1) 10Juniorsys: bacula module: Trailing commas, full class names [puppet] - 10https://gerrit.wikimedia.org/r/332094 (https://phabricator.wikimedia.org/T93645) [04:00:48] (03PS1) 10Juniorsys: conftool module: Linting changes [puppet] - 10https://gerrit.wikimedia.org/r/332095 (https://phabricator.wikimedia.org/T93645) [04:01:06] RECOVERY - puppet last run on mw1252 is OK: OK: Puppet is currently enabled, last run 53 seconds ago with 0 failures [04:01:48] (03PS1) 10Juniorsys: contint module: Linting changes [puppet] - 10https://gerrit.wikimedia.org/r/332096 (https://phabricator.wikimedia.org/T93645) [04:02:47] (03PS1) 10Juniorsys: dataset module: Linting changes [puppet] - 10https://gerrit.wikimedia.org/r/332097 (https://phabricator.wikimedia.org/T93645) [04:04:07] (03PS1) 10Juniorsys: diamond module: Add trailing commas [puppet] - 10https://gerrit.wikimedia.org/r/332098 (https://phabricator.wikimedia.org/T93645) [04:05:10] (03PS1) 10Juniorsys: druid module: Linting changes [puppet] - 10https://gerrit.wikimedia.org/r/332099 (https://phabricator.wikimedia.org/T93645) [04:06:11] (03CR) 10Jcrespo: [C: 04-1] "dbstore2001 has started to lag again, we may need to deploy something quick for dbstore2XXX soon." [puppet] - 10https://gerrit.wikimedia.org/r/328671 (https://phabricator.wikimedia.org/T130128) (owner: 10Jcrespo) [04:06:37] (03PS1) 10Juniorsys: ganglia module: Use full names for class names [puppet] - 10https://gerrit.wikimedia.org/r/332100 (https://phabricator.wikimedia.org/T93645) [04:08:14] (03PS1) 10Juniorsys: geowiki module: Lint changes + modes/umask quoting [puppet] - 10https://gerrit.wikimedia.org/r/332101 (https://phabricator.wikimedia.org/T93645) [04:09:01] (03CR) 10jerkins-bot: [V: 04-1] geowiki module: Lint changes + modes/umask quoting [puppet] - 10https://gerrit.wikimedia.org/r/332101 (https://phabricator.wikimedia.org/T93645) (owner: 10Juniorsys) [04:09:45] (03PS1) 10Juniorsys: install_server module: Linting changes [puppet] - 10https://gerrit.wikimedia.org/r/332102 [04:15:44] (03PS1) 10Juniorsys: mediawiki module: Linting changes [puppet] - 10https://gerrit.wikimedia.org/r/332103 (https://phabricator.wikimedia.org/T93645) [04:30:09] (03PS1) 10Juniorsys: postgresql module: Linting changes [puppet] - 10https://gerrit.wikimedia.org/r/332104 (https://phabricator.wikimedia.org/T93645) [05:08:26] (03PS1) 10Juniorsys: puppetmaster module: Linting changes [puppet] - 10https://gerrit.wikimedia.org/r/332105 (https://phabricator.wikimedia.org/T93645) [05:09:35] (03PS1) 10Juniorsys: role analytics_cluster: Linting changes [puppet] - 10https://gerrit.wikimedia.org/r/332106 (https://phabricator.wikimedia.org/T93645) [05:13:49] (03PS1) 10Juniorsys: site.pp - Use full class names, not relative ones [puppet] - 10https://gerrit.wikimedia.org/r/332107 (https://phabricator.wikimedia.org/T93645) [05:14:26] Hmm, is there maintenance going on right now? [05:16:05] https://integration.wikimedia.org came back with Request from XX via cp4003 cp4003, Varnish XID 64063457 Error: 503, Backend fetch failed [05:21:04] (03PS1) 10Juniorsys: snapshot module: Use full names for class names [puppet] - 10https://gerrit.wikimedia.org/r/332108 (https://phabricator.wikimedia.org/T93645) [05:26:16] (03PS1) 10Juniorsys: statistics module: Linting changes [puppet] - 10https://gerrit.wikimedia.org/r/332109 (https://phabricator.wikimedia.org/T93645) [05:30:52] (03PS1) 10Juniorsys: toollabs role modules: Linting changes [puppet] - 10https://gerrit.wikimedia.org/r/332110 (https://phabricator.wikimedia.org/T93645) [05:32:19] (03PS1) 10Juniorsys: toollabs module: Linting changes [puppet] - 10https://gerrit.wikimedia.org/r/332111 [05:33:42] (03PS1) 10Juniorsys: torrus module: Linting changes [puppet] - 10https://gerrit.wikimedia.org/r/332112 (https://phabricator.wikimedia.org/T93645) [05:34:46] (03PS1) 10Juniorsys: varnish module: Linting changes [puppet] - 10https://gerrit.wikimedia.org/r/332113 (https://phabricator.wikimedia.org/T93645) [06:29:06] PROBLEM - Check HHVM threads for leakage on mw1168 is CRITICAL: CRITICAL: HHVM has more than double threads running or queued than apache has busy workers [06:36:06] PROBLEM - Check HHVM threads for leakage on mw1260 is CRITICAL: CRITICAL: HHVM has more than double threads running or queued than apache has busy workers [06:47:26] (03PS2) 10Juniorsys: geowiki module: Lint changes + modes/umask quoting [puppet] - 10https://gerrit.wikimedia.org/r/332101 (https://phabricator.wikimedia.org/T93645) [07:07:20] (03CR) 10Marostegui: "Yes, I saw the lag yesterday too. I have been watching it during the all hands and it was going from 5h to 1day (yesterday when I checked " [puppet] - 10https://gerrit.wikimedia.org/r/328671 (https://phabricator.wikimedia.org/T130128) (owner: 10Jcrespo) [07:42:46] PROBLEM - citoid endpoints health on scb1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [07:42:46] PROBLEM - citoid endpoints health on scb1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [07:43:36] RECOVERY - citoid endpoints health on scb1002 is OK: All endpoints are healthy [07:43:36] RECOVERY - citoid endpoints health on scb1001 is OK: All endpoints are healthy [08:00:06] PROBLEM - Check HHVM threads for leakage on mw1168 is CRITICAL: CRITICAL: HHVM has more than double threads running or queued than apache has busy workers [08:33:06] RECOVERY - Check HHVM threads for leakage on mw1260 is OK: OK [08:41:06] PROBLEM - Check HHVM threads for leakage on mw1168 is CRITICAL: CRITICAL: HHVM has more than double threads running or queued than apache has busy workers [08:49:06] PROBLEM - Check HHVM threads for leakage on mw1168 is CRITICAL: CRITICAL: HHVM has more than double threads running or queued than apache has busy workers [08:57:06] RECOVERY - Check HHVM threads for leakage on mw1168 is OK: OK [09:49:24] 06Operations, 10Mail: Do not apply spam headers on email assessed NOT to be spam - https://phabricator.wikimedia.org/T111595#2940814 (10Nemo_bis) [10:04:06] PROBLEM - puppet last run on sca1004 is CRITICAL: CRITICAL: Puppet has 25 failures. Last run 2 minutes ago with 25 failures. Failed resources (up to 3 shown): Package[zotero/translation-server],Exec[chown /srv/deployment/zotero for deploy-service],Service[puppet],Service[rsyslog] [10:32:06] RECOVERY - puppet last run on sca1004 is OK: OK: Puppet is currently enabled, last run 11 seconds ago with 0 failures [12:07:49] What's the deployment calendar looking like for the coming week? https://wikitech.wikimedia.org/wiki/Deployments hasn't been updated [12:07:53] greg-g: ^ [12:12:26] PROBLEM - puppet last run on cp3035 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [12:40:26] RECOVERY - puppet last run on cp3035 is OK: OK: Puppet is currently enabled, last run 29 seconds ago with 0 failures [13:39:26] PROBLEM - puppet last run on sca2004 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [14:07:26] RECOVERY - puppet last run on sca2004 is OK: OK: Puppet is currently enabled, last run 10 seconds ago with 0 failures [14:08:50] (03PS1) 10Urbanecm: Add one throttle rule + remove obsolete ones [mediawiki-config] - 10https://gerrit.wikimedia.org/r/332134 (https://phabricator.wikimedia.org/T155345) [14:26:26] PROBLEM - Redis status tcp_6479 on rdb2006 is CRITICAL: CRITICAL: replication_delay is 640 600 - REDIS 2.8.17 on 10.192.48.44:6479 has 1 databases (db0) with 2902625 keys, up 76 days 6 hours - replication_delay is 640 [14:35:06] PROBLEM - puppet last run on sca1004 is CRITICAL: CRITICAL: Puppet has 27 failures. Last run 2 minutes ago with 27 failures. Failed resources (up to 3 shown): Exec[eth0_v6_token],Package[zotero/translators],Package[zotero/translation-server],Exec[chown /srv/deployment/zotero for deploy-service] [14:46:26] RECOVERY - Redis status tcp_6479 on rdb2006 is OK: OK: REDIS 2.8.17 on 10.192.48.44:6479 has 1 databases (db0) with 2901894 keys, up 76 days 6 hours - replication_delay is 0 [15:02:53] (03CR) 10Luke081515: [C: 031] Add one throttle rule + remove obsolete ones [mediawiki-config] - 10https://gerrit.wikimedia.org/r/332134 (https://phabricator.wikimedia.org/T155345) (owner: 10Urbanecm) [15:03:06] RECOVERY - puppet last run on sca1004 is OK: OK: Puppet is currently enabled, last run 40 seconds ago with 0 failures [15:35:16] PROBLEM - puppet last run on es1019 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:48:06] PROBLEM - puppet last run on labvirt1010 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [16:04:16] RECOVERY - puppet last run on es1019 is OK: OK: Puppet is currently enabled, last run 59 seconds ago with 0 failures [16:16:06] RECOVERY - puppet last run on labvirt1010 is OK: OK: Puppet is currently enabled, last run 40 seconds ago with 0 failures [16:23:06] PROBLEM - puppet last run on kafka1013 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [16:50:06] RECOVERY - puppet last run on kafka1013 is OK: OK: Puppet is currently enabled, last run 17 seconds ago with 0 failures [16:55:59] 06Operations, 06Discovery, 06Maps, 06WMF-Legal, 03Interactive-Sprint: Define tile usage policy - https://phabricator.wikimedia.org/T141815#2941262 (10timautin) Great, thanks! So this means that caching (storing tiles as the user browses the map to improve performance) is allowed but not bulk downloading... [17:44:55] 06Operations, 10Traffic, 10Wikidata, 07HTTPS: wikiba.se should use HTTPS - https://phabricator.wikimedia.org/T155359#2941317 (10abian) [17:59:45] 06Operations, 10media-storage: unknown error occurred in storage backend "local-swift-codfw" - https://phabricator.wikimedia.org/T155323#2940368 (10Urbanecm) [18:00:07] 06Operations, 10media-storage: unknown error occurred in storage backend "local-swift-codfw" - https://phabricator.wikimedia.org/T155323#2940368 (10Urbanecm) [18:00:49] 06Operations, 10media-storage: unknown error occurred in storage backend "local-swift-codfw" - https://phabricator.wikimedia.org/T155323#2941343 (10Urbanecm) [18:00:56] 06Operations, 10media-storage: unknown error occurred in storage backend "local-swift-codfw" - https://phabricator.wikimedia.org/T155323#2940368 (10Urbanecm) [18:01:15] 06Operations, 10media-storage: unknown error occurred in storage backend "local-swift-codfw" - https://phabricator.wikimedia.org/T155323#2940368 (10Urbanecm) [18:04:03] 06Operations, 10media-storage: unknown error occurred in storage backend "local-swift-codfw" - https://phabricator.wikimedia.org/T155323#2941356 (10Urbanecm) [18:04:41] 06Operations, 10media-storage: unknown error occurred in storage backend "local-swift-codfw" - https://phabricator.wikimedia.org/T155323#2940368 (10Urbanecm) [18:04:43] 06Operations, 10media-storage: unknown error occurred in storage backend "local-swift-codfw" - https://phabricator.wikimedia.org/T155323#2940368 (10Urbanecm) [18:05:52] 06Operations, 10media-storage: unknown error occurred in storage backend "local-swift-codfw" - https://phabricator.wikimedia.org/T155323#2940368 (10Urbanecm) [18:06:40] 06Operations, 10media-storage: unknown error occurred in storage backend "local-swift-codfw" - https://phabricator.wikimedia.org/T155323#2941368 (10Urbanecm) 05Open>03stalled [18:49:16] PROBLEM - puppet last run on restbase1009 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [18:52:11] 06Operations, 10MediaWiki-API, 10Traffic: Varnish does not cache Action API responses when logged in - https://phabricator.wikimedia.org/T155314#2941430 (10Anomie) Note it doesn't really depend on "logged in", just on the presence of anything matching `([sS]ession|Token)=` in the Cookie header: ``` $ curl -i... [19:04:42] 06Operations, 10media-storage: unknown error occurred in storage backend "local-swift-codfw" - https://phabricator.wikimedia.org/T155323#2941433 (10zhuyifei1999) @Urbanecm Can I ask what is this task stalled on? [19:15:28] PROBLEM - puppet last run on ms-be2003 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[mkfs-/dev/sdj1] [19:15:28] PROBLEM - MegaRAID on ms-be2003 is CRITICAL: CRITICAL: 1 failed LD(s) (Offline) [19:15:39] ACKNOWLEDGEMENT - MegaRAID on ms-be2003 is CRITICAL: CRITICAL: 1 failed LD(s) (Offline) nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T155363 [19:15:44] 06Operations, 10ops-codfw: Degraded RAID on ms-be2003 - https://phabricator.wikimedia.org/T155363#2941454 (10ops-monitoring-bot) [19:18:19] RECOVERY - puppet last run on restbase1009 is OK: OK: Puppet is currently enabled, last run 53 seconds ago with 0 failures [19:25:00] 06Operations, 10media-storage: unknown error occurred in storage backend "local-swift-codfw" - https://phabricator.wikimedia.org/T155323#2941458 (10Urbanecm) 05stalled>03Open On nothing. I marked it as stalled by mistake. Sorry! [19:25:42] 06Operations, 10media-storage: unknown error occurred in storage backend "local-swift-codfw" - https://phabricator.wikimedia.org/T155323#2941469 (10Urbanecm) But I think this task is at least Normal priority if not high. [19:30:54] 06Operations, 10Traffic, 10Wikidata, 07HTTPS: wikiba.se should use HTTPS - https://phabricator.wikimedia.org/T155359#2941299 (10Bugreporter) Probably we should fix {T99531} first [19:39:02] 06Operations, 10Traffic, 10Wikidata, 07HTTPS: wikiba.se should use HTTPS - https://phabricator.wikimedia.org/T155359#2941299 (10Lydia_Pintscher) Yes, agreed. [19:39:14] 06Operations, 10Traffic, 10Wikidata, 07HTTPS: wikiba.se should use HTTPS - https://phabricator.wikimedia.org/T155359#2941481 (10Lydia_Pintscher) p:05Triage>03Low [20:48:59] PROBLEM - puppet last run on elastic1040 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [20:58:53] (03CR) 10Eevans: [C: 031] cassandra: add jmx_exporter to Cassandra in deployment-prep [puppet] - 10https://gerrit.wikimedia.org/r/331911 (https://phabricator.wikimedia.org/T155120) (owner: 10Filippo Giunchedi) [20:59:03] (03PS2) 10Eevans: cassandra: add jmx_exporter to Cassandra in deployment-prep [puppet] - 10https://gerrit.wikimedia.org/r/331911 (https://phabricator.wikimedia.org/T155120) (owner: 10Filippo Giunchedi) [21:17:59] RECOVERY - puppet last run on elastic1040 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [21:34:53] 06Operations, 10Analytics, 10ChangeProp, 10Citoid, and 11 others: Node 6 upgrade planning - https://phabricator.wikimedia.org/T149331#2941762 (10mobrovac) @MoritzMuehlenhoff can you upload the new node version to our APT? We'd like to move on this issue this week. [21:35:23] (03CR) 10Gergő Tisza: Reinstate "Remove MWVersion, fold its two functions into MWMultiVersion" (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/331552 (owner: 10Reedy) [22:26:09] PROBLEM - puppet last run on ms-be1022 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [22:55:09] RECOVERY - puppet last run on ms-be1022 is OK: OK: Puppet is currently enabled, last run 48 seconds ago with 0 failures [23:29:19] PROBLEM - puppet last run on db1049 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [23:58:19] RECOVERY - puppet last run on db1049 is OK: OK: Puppet is currently enabled, last run 54 seconds ago with 0 failures